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Chapter 1 


Mathematical Background 


The purpose of this chapter is to briefly overview the key mathematical ways of thinking that 
underpin our presentation of the subject of data assimilation . In particular we touch on 
the subjects of probability, dynamical systems, probability metrics and dynamical systems 
for probability measures, in sections 11.11 11.21 11.31 and 11.41 respectively. Our treatment is 
necessarily terse and very selective and the bibliography section [T75] provides references to the 
literature. We conclude with exercises in section nn 

We highlight here the fact that, throughout this book, all probability measures on will 
be assumed to possess a density with respect to Lebesgue measure and, furthermore, this 
density will be assumed to be strictly positive everywhere in R^. This assumption simplifies 
greatly our subsequent probabilistic calculations. 


1.1 Probability 

We describe here some basic notation and facts from probability theory, all of which will be 
fundamental to formulating data assimilation from a probabilistic perspective. 


1.1.1. Random Variables on 


We consider random variables z on R^. To define a probability measure ^ on R^ we need 
to work with a sufficiently rich collection of subsets of R^, to each of which we can assign 
the probability that 2 ; is contained in it; this collection of subsets is termed a a-algebra. 
Throughout these notes we work with S(R^), the Borel tr-algebra generated by the open sets; 
we will abbreviate this cr-algebra by B, when the set R^ is clear. The Borel cr-algebra is the 
natural collection of subsets available on R^; an element in B will be termed a Borel set. From 
a practical viewpoint the reader of this book does not need to understand the finer properties 
of the Borel cr-algebra. 

We have defined a probability triple (R^,;B,/x). For simplicity we assume throughout the 
book that z has a strictly positive probability density function (pdf) p with respect to Lebesgue 
measure. Then, for any Borel set d C R^, 


P(d) = P( 2 : S d) = / p{x)dx, 

J A 

where p : R^ —>• R+ satisfies 

/ p{x) dx = 1. 
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A Borel set A C is sometimes termed an event and the event is said to occur almost surely 
if P(A) = 1. Since p integrates to 1 over K.^ and is strictly positive, this implies that the 
Lebesgue measure of the complement of A, the set is zero. 

We write z pas shorthand for the statement that z is distributed according to probability 
measure p on Note that here p : S(M^) —>■ [0,1] denotes a probability measure and 
p : —>■ R+ the corresponding density. However, we will sometimes use the letter P to 

denote both the measure and its corresponding pdf. This should create no confusion: P(-) 
will be a probability measure whenever its argument is a Borel set, and a density whenever 
its argument is a point in R^. 

For any function / : R^ —>■ we denote by E/(z) the expeeted value of the random 

variable f{z) on this expectation is given by 

E/(z) = / f{x)p{dx), p{dx) = p{x)dx. 

We also sometimes write p{f) for E/(z). The case where the function / is vector valued 
corresponds to g = 1 so that = R^^^ = MP. We will sometimes write E'^ if we wish 

to differentiate between different measures with respect to which the expectation is to be 
understood. 

The characteristic function of the random variable z on R^ is : R^ —> C defined by 

cf(/i) = Eexp(i(/i, z)). 

Example 1.1 Let 1=1 and set p{x) = P{^) > 0 every a: G R. Also, 

using the change of variables x = tan 9, 

dx _2 r 2sec^0d0 _ 2 r^ 

7r(l+a:2) 7r(l+a:2) 7arctan(o) 7r(l + tan2 6l) tt Jq 

and therefore p is the pdf of a random variable z on R. We say that such random variable has 
the Cauchy distribution. 4|k 

The pushforward of a pdf p on R* under a map G : R^ —>• R^ is denoted G-k p. It may be 
calculated explicitly by means of the change of variable formula under an integral. Indeed if 
G is invertible then 

Gkp{v) := p{G-\v))\DG-\v)\. 

We will occasionally use the Markov inequality which states that, for a random variable 2 
on R^, and any i? > 0, 

P(|z| >-R) < i?”^E|z|. (1.1) 

As a consequence 

P(|2| < i?) > l-i?-iE|z|. (1.2) 

In particular, if E|z| < oo, then choosing R sufficiently large shows that P(|2| < i?) > 0. In 
our setting this last inequality follows in any case, by assumption on the strict positivity of 
p(-) everywhere on R^. 

Finally we say that a sequence of probability measures on R^ is said to converge 
weakly to a limiting probability measures p on R^ if, for all continuous bounded functions 
(p : R^ -)> R, 

as n —>• oo. 
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1.1.2. Gaussian Random Variables 


We work in finite dimensions, but all the ideas can be generalized to infinite dimensional 
contexts, such as the Hilbert space setting, for example. A Gaussian random variablcf^ on 
is characterized by: 

• Mean: m £ 


• Covariance: C G C > 0. 

We write z ^ N{m, C) and call the Gaussian random variable centred if m = 0. If C > 0 then 
z has strictly positive pdf on given by 


p{x) 




It can be shown that indeed p given by (O satisfies 


(1.3a) 

(1.3b) 



p{x) dx = 1. 


(1.4) 


Lemma 1.2 Let z ^ N{m, C), C > 0. Then 


1. Ez = m. 

2. E(z — m)(z — m)'^ = C. 


Proof For the first item 


Ez = 


■ [ xexp(-i 
jRf ^ 


(2^)C2(det (7)1/2 


— -\X — 77 l|c) dx 

1 , 


1 


1 , 


m 


(2^)C2(det (7)1/2 
= 0 + m 






exp(-2lylc) dy 


= m, 

where we used in the last line that the function y i—>■ 2 /exp(—i|?/|p) is even and the fact that, 
by (fTdl) . 

(2^)C2(detC)i/2 ^ 

For the second item 

\ f 1 

E(z-m)(z-TO)^ = (27r)C2(det (7)i/2 exp(--|a; - to|^) dx 

= (2,)«/^(Lc)V2 /„ 'iy 

= ( 2 ,)</^(Lc)./^ t C'/=™-CV^exp(-i|.nd,t(CV=).„ 

= ci/Vci/2 

^Sometimes also called normal random variable 
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where 


1 / 1 

exp(—-|u>p) dw G X 
(27r)v^ V ^ 2 ' 

and so 

^ 1 ^^ 

=7^^ L 2 ^ n 

' fe=l k=l 

To complete the proof we need to show that J is the identity matrix / on R^ x R^. Indeed, 
for i ^ j 

Jij (X / Wi exp{--Wi) dwt / Wj exp{--Wj) dwj = 0, 

Js. 2 Jjj 2 

by symmetry; and for i = j 






exp(--w;fc) dwk 


i-i 


(27r)2 


Vjex^{--wf^ 


+ 77^ / exp(--icf) dicj = 1, 

(27r)2 Jr 2 


where we again used (11.41) in the first and last lines. Thus J = I, the identity in R^, and 

E(z-m)(z-m)^ = Ci/2ci/2 = c. □ 

The following characterization of Gaussians is often useful. 


Lemma 1.3 The characteristic function of the Gaussian N(m,C) is given by 

d{h) = exp(i{h,m) — ^(Ch, h)). 

Proof This follows from noting that 

i|a; — m\Q — i{h,x) = ^\x — {m + iCh)\^ — i{h,m) + h). 

□ 


Remark 1.4 Note that the pdf for the Gaussian random variable that we wrote down in 
equation |L3 is defined only for C > 0 since it involves C ^. The characteristic function ap¬ 
pearing in the preceding lemma can be used to define a Gaussian with mean m and covariance 
C, including the case where C > 0 so that the Gaussian covariance C is only positive semi- 
definite, since it is defined in terms of C and not C~^. For example if we let z ^ N{m,C) 
with C = 0 then z is a Dirac mass at m, i.e. z = m almost surely and for any continuous 
function f 

E/(2) = /(m). 

This Dirac mass may be viewed as a particular case of a Gaussian random variable. We will 
write 5m for N{m,0). 

Lemma 1.5 The following hold for Gaussian random variables: 

• If z = aizi + 02^2 where zi,Z 2 are independent Gaussians with distributions N(mi,Ci) 
and N{m 2 ,C 2 ) respectively then z is Gaussian with distribution N{aimi-\-a 2 'm 2 ,a\Ci-\- 

alC2). 
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• If z ^ N{m, C) and w = Lz + a then w ~ N{Lm + a, LClf^). 

Proof The first result follows from computing the characteristic function of z. By indepen¬ 
dence this is the product of the characteristic functions of aizi and of 022 : 2 - The characteristic 
function of OiZi has logarithm equal to 

i{h,aimi) - ^{a‘^Ch,h). 

Adding this for i = 1, 2 gives the logarithm of the characteristic function of z. 

For the second result we note that the characteristic function of o -I- Lz is the expectation 
of the exponential of 

i{h, a -I- Lz) = i{h^ a) + i{L'^h^ z). 

Using the properties of the characteristic functions of z we deduce that the logarithm of the 
characteristic function oi a + Lz is equal to 

i{h,a) + i{L'^ h,m) — - 

This may be re-written as 

i{h, a + Lm) — -{LCL^^h, h) 

which is the logarithm of the characteristic function of N{a + Lm, LCL"^) as required. □ 
We finish by stating a lemma whose proof is straightforward, given the foregoing material 
in this section, and left as an exercise. 

Lemma 1.6 Define 

Hv) ■= - m), L{v - m)) 

with L G Mfym satisfying L > 0 and m G Then exp(—/(u)) can he normalized to produce 
the pdf of the Gaussian random variable N{m, L~^) on R^. The matrix L is known as the 
precision matrix of the Gaussian random variable. 

1.1.3. Conditional and Marginal Distributions 

Let (a, 6) G X R™ denote a jointly varying random variable. 

Definition 1.7 The marginal pdf of a, P(a), is given in terms of the pdf of (a, b), P(a, b), by 

F{a)= [ F{a,b)db. 

Remark 1.8 With this definition, for A C ;B(R^), 

P(a e A) =p((a,6) € A X R"* ) = / [ F{a,b)dadb 

\ J JA v/r™ 

= [ ( [ P(fli b) db \ da = f P(a) da. 

J A V-Zr™ / Ja 

Thus the marginal pdfF{a) is indeed the pdf for a in situations where we have no information 
about the random variable b, other than that it is in M™. 
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We now consider the situation which is the extreme opposite of the marginal situation. 
To be precise, we assume that we know everything about the random variable b: we have 
observed it and know what value it takes. This leads to consideration of the random variable 
a given that we know the value taken by 6; we write a\b for a given b. The following definition 
is then natural: 


Definition 1.9 The conditional pdf of a\b, P(a|6), is defined by 


P(a|6) 


P(a,6) 

P(6) 


(1.5) 

4 


Remark 1.10 Conditioning a jointly varying random variable can be useful when computing 
probabilities, as the following calculation demonstrates. 


{a,b) G A X B \ = 


/ / ]P{a,b) dadb 
J A J B 

[ [ F{a\b)F{b)dadb 
J A J B 

P(a|6) da ] P(6) db. 


B \J A 


=:/l 


^ =:/2 


Given b, Ii computes the probability that a is in A. I 2 then denotes averaging over given 
outcomes ofbinB. 4|k 


1.1.4. Bayes’ Formula 

By Definition 11.91 we have 


P(a,6) =P(a|6)P(6), (1.6a) 

P(a,6) =P(6|a)P(a). (1.6b) 

Equating and rearranging we obtain Bayes’ formula which states that 

P(«l^) = ^P(d|a)P(a). (1.7) 

The beauty of this formula is apparent in situations where P(a) and P(6|a) are individually 
easy to write down. Then P(a|6) may be identified easily too. 

Example 1.11 Let (a, 6) £ K x K be a jointly varying random variable specified via 

P(a); 

5|a^fV(/(a),7'), P(&|a). 

Notice that, by using equation m, P(a, b) is defined via two Gaussian distributions. In fact 
we have 


9 






Unless /(•) is linear this is not the pdf of a Gaussian distribution. Integrating over a we 
obtain, from the definition of the marginal pdf of b, 

- 2^1“ - “'O 

Using equation (HU then shows that 

Note that a\b, like {a,b), is not Gaussian. Thus, for both {a,b) and a\b, we have constructed 
a non-Gaussian pdf in a simple fashion from the knowledge of the two Gaussians and a and 
b\a. 

When Bayes’ formula (HU is used in statistics then typically b is observed data and a is 
the unknown about which we wish to find information, using the data. In this context we 
refer to P(a) as the prior, to P(&|a) as the likelihood and to P(a|6) as the posterior. The 
beauty of Bayes’s formula as a tool in applied mathematics is that the likelihood is often easy 
to determine explicitly, given reasonable assumptions on the observational noise, whilst there 
is considerable flexibility inherent in modelling prior knowledge via probabilities, to give the 
prior. Combining the prior and likelihood as in (HU gives the posterior, which is the random 
variable of interest; whilst the probability distributions used to define the likelihood P(6|a) 
(via a probability density on the data space) and prior P(a) (via a probability on the space of 
unknowns) may be quite simple, the resulting posterior probability distribution can be very 
complicated. A second key point to note about Bayes’ formula in this context is that P(f»), 
which normalizes the posterior to a pdf, may be hard to determine explicitly, but algorithms 
exist to find information from the posterior without knowing this normalization constant. We 
return to this point in subsequent chapters. 

1.1.5. Independence 

Consider the jointly varying random variable (a, 6) G x K™. The random variables a and 
b are said to be independent if 

P(a,6) =P(a)P(6). 

In this case, for / : —>• R^' and g : R™ —>■ R™', 

Ef{a)g{bf = (E/(a)) x {Eg{bf) 
as 

^fia)g{b)'^ = [ f{a)g{b)'^P{a)¥{b)dadb= ( [ f{a)F{a)da\ ( [ g{b)'^V{b)db\ . 

An i.i.d. (independent, identically distributed) sequence is one for whichU 

• each is distributed according to the same pdf p; 

• fj is independent of for j k. 

If J is a subset of N with finite cardinality then this i.i.d. sequence satishes 

jes 

^This discussion is easily generalized to j £ Z+. 
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1.2 Dynamical Systems 


We will discuss data assimilation in the context of both discrete-time and continuous-time 
dynamical systems. In this section we introduce some basic facts about such dynamical 
systems. 


1.2.1. Iterated Maps 

Let G We will frequently be interested in the iterated map, or discrete-time dy¬ 

namical system, defined by 

Vj+i = Vo=U, 

and in studying properties of the sequence {vj}j^z+- A fixed point of the map is a point Voo 
which satisfies Vao = 'I'(^^oo); initializing the map at u = Uoo will result in a sequence satisfying 
Vj = Vfx for all j G Z+. 


Example 1.12 Let 


'^{v) = Xv + a. 


Then 

Vj+i = Xvj -I- a, vq = u. 
By induction we see that, for A 7 ^ 1, 


j-i 

X^u + a A* 

i=0 


X^u + a 


1-A^ 

1 - A ■ 


Thus */ |A| < 1 then 

a 

Vi —>■ -- as 7 —>■ 00 . 

^ 1 - A 

The limiting value is a fixed point of the map. ^ 

Remark 1.13 In the preceding example the long-term dynamics of the map, for |A| < 1, is 
described by convergence to a fixed point . Far more complex behaviour is, of course, possible; 
we will explore such complex behaviour in the next chapter. ^ 


The following result is known as the (discrete time) Gronwall lemma. 


Lemma 1.14 Let {vj}j(=j,+ be a positive sequence and (A, a) a pair of reals with A > 0. Then 

if 

Vj+i < Xvj -to, j = 0 , 1 ,... 


it follows that 


and 


< X^vq a 


1-A^ 
1-A ’ 


A 7 ^ 1, 


Vj < 'Co + jOi A = 1. 
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Proof We prove the case A 1 as the case A = 1 may be proved similarly. We proceed by 
induction. The result clearly holds for j = 0. Assume that the result is true for j = J. Then 


vj+i < \vj + a 

^ A ^A^r’o “b ^ 

= A'^+^ 


Vq a 


= A'^''"^fQ + a 


1 -A-^ 
1-A 
A - A-^+i 


1-A 
1 - A“'+^ 
1-A ' 


1 - A 
' 1 -A 


This establishes the inductive step and the proof is complete. 

We will also be interested in stochastic dynamical systems of the form 


□ 


Vj+i='i/{vj)+£,j, vo = u, 

where ^ is an i.i.d. sequence of random variables on and w is a random variable 

on independent of 

Example 1.15 This is a simple but important one dimensional (i.e. i = \.} example. Let 
|A| < 1 and let 


Vj+i = Xvj + Ci, 

Vo ^ N{mo,crl). 


fj ^ N{0,a'^) i.i.d., 


By induction 


Vj = X^vo + 

i=0 


Thus Vj is Gaussian, as a linear transformation of Gaussians - see Lemma \1.5l Furthermore, 
using independence of the initial condition from the sequence f,, we obtain 


mj := lEiij = X^mo 
:= E{vj - mjf 


A2JE(uo - mof + A2J-2*-V2 

i =0 


i -1 

= X^^al + o-'^Y 

i=0 


X^^al + cr^ 


1 - A^J 
1 -A 2 ■ 


Since |A| < 1 rce deduce thatmj —>■ 0 and cr| —>■ ( 7^(1 —A^)“^. Thus the sequence of Gaussians 
generated by this stochastic dynamical system has a limit, which is a centred Gaussian with 
variance larger than the variance of fi, unless A = 0. 


1.2.2. Differential Equations 

Let / € and consider the ordinary differential equation (ODE) 

^ = f{v), v( 0 ) = u. 
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Assume a solution exists for all u G t S M"*"; for any given u this solution is then an element 
of the space R^). In this situation, the ODE generates a continuous-time dynamical 

system. We are interested in properties of the function v. An equilibrium point Voo G is a 
point for which /(uoo) = 0. Initializing the equation at u = Uoo results in a solution v{t) = Voo 
for all t > 0. 


Example 1.16 Let f{v) 


—av -I- j3. Then 


dv 


130 “ 


and so 

Thus 

so that 



If a > 0 then 


v{t) - 
a 


as t —>■ oo. 

Note that Voo ■= — is a the unique equilibrium point of the equation. 




Remark 1.17 In the preceding example the long-term dynamics of the ODE, for a > 0, is 
described by convergence to an equilibrium point . As in discrete time, far more complex 
behaviour is, of course, possible; we will explore this possibility in the next chapter. 

If the differential equation has a solution for every u S R^ and every t G R+ then there is a 
one-parameter semigroup of operators '!'(•; t), parametrized by time t > 0, with the properties 
that 


v{t) = 'I'(u;t), tG (0, oo), (l.IOa) 

'^{u]t-\-s) = 4'('I'(u; s); t), t, s G R^, M £ R^, (1.10b) 

«'(m;0) = uGR^. (1.10c) 

We call '!'(•;•) the solution operator for the ODE. In this scenario we can consider the iterated 
map defined by '!'(•) = h), for some fixed h > 0, thereby linking the discrete time iterated 

maps with continuous time ODEs. 

Example 1.18 fExamvle \1.16\ continued) Let 

which is the solution operator for the equation in that vft) = '^{u]t). Clearly ^'(m;0) = u. 
Also 

^'(u; t + s) = ^ (l “ 

= ^ (1 - ^ 

= («'(M;s);t). 
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The following result is known as the (continuous time) Gronwall lemma. 


Lemma 1.19 Let z € (K.’*', K) satisfy 


dz , , 

— <az + 6 , z{0) = zq, 


for some a, 6 £ K. Then 


2(t)<e“‘zo + -(e“*-l). 
a 

Proof Multiplying both sides of the given identity by we obtain 



-at ^ ^ -at 


\dt J - 

which implies that 

4- < 6e-“‘. 

dt^ ^ - 

Therefore, 

- z(0) < -(l-e-“‘) 
a ^ ^ 

so that 



z(t)<e“‘zo + -(e“*-l). 
a 


□ 


1.2.3. Long-Time Behaviour 

We consider the long-time behaviour of discrete-time dynamical systems . The ideas are 
easily generalized to continuous-time dynamical systems - ODEs - and indeed our example 
will demonstrate such a generalization. To facilitate our definitions we now extend to act 
on Borel subsets of Note that currently —>• R^; we extend to : S(R^) —>• S(R^) 

via 

^{A) = U A e S(R^). 

u^A 

For both dt : R^ —>• R^ and dt : S(R^) —>■ S(R^) we denote by 

V]/0) = v[/ o • • • o V]/ 

the j—fold composition of 'k with itself. In the following, let B{0, R) denote the ball of radius 
i? in R^, in the Euclidean norm, centred at the origin. 

Definition 1.20 A discrete time dynamical system has a bounded absorbing set Sabs C R^ 
if, for every R > 0, there exists J = J{R) such that 

CSabs, Vj>J. 


Remark 1.21 The definition of absorbing set is readily generalized to continuous time dy¬ 
namical systems; this is left as an exercise for the reader. ^ 
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Example 1.22 Consider an ODE for which there exist a, f3 > 0 such that 


(/(?;), ti) < Of —/3|wp, Vi; G 


Then 

Applying the Gronwall Lemma \l.l9\ gives 


Kt)P<e-2^*K0)p + ^(l-e-2/^‘). 

Hence, if |?^(0)p < R then 

\v{t)\^<2^ Vt >T: < ^. 


Therefore the set B^hs = B ^0, absorbing for the ODE (with the generalization of 

the above definition of absorbing set to continuous time, as in Remark \1.21\) . 

If Vj = v{jh) so that ^'(•) = '!'(•; h) and Vj+i = 'l'(i'j) then 



VJ> 


T 


where T is as for the ODE case. Hence Sabs 
iterated map associated with the ODE. 



is also an absorbing set for the 


Definition 1.23 When the discrete time dynamical system has a bounded absorbing set Sabs 
we define the global attractor A to be 

n U ^^^H-Babs). 

fc>0 j>k 


This object captures all the long-time dynamics of the dynamical system. As for the absorbing 
set itself this definition is readily generalized to continuous time. 

1.2.4. Controlled Dynamical Systems 

It is frequently of interest to add a controller w = discrete time dynamical 

system to obtain 

Vj + I = 'Il{Vj)+Wj. 

The aim of the controller is to “steer” the dynamical system to achieve some objective. 
Interesting examples include: 

• given point v* G and time J , choose w so that vj = v*; 

• given open set B and time J G Z+, choose w so that Vj G B for all j > J; 

• given y = {j/jljeN, where yj G M"*, and given a function h : K.^ —>• M™, choose w to keep 
\yj — h{vj)\ small in some sense. 
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The third option is most relevant in the context of data assimilation , and so we focus on 
it. In this context we will consider controllers of the form Wj = K{yj — h{vj)) so that 

Vj+i = 'i>{vj)+K{yj - h{vj)). ( 1 . 11 ) 

A key question is then how to choose K to ensure the desired property. We present a simple 
example which illustrates this. 

Example 1.24 Let i = m = 1, 'P(v) = Xv and h(v) = v. We assume that the data {yj}j^fq is 
given by yj+i = Vj+i where = Az;|. Thus the data is itself generated by the uncontrolled 
dynamical system. We wish to use the controller to ensure that the solution of the controlled 
system is close to the data {yj jjgjsj generated by the uncontrolled dynamical system, and hence 
to the solution of the uncontrolled dynamical system itself. 

Consider the controlled dynamical system 


Vj+i ='ii{vj) + K {yj - h{vj)) 

= Xvj + K{yj - Vj), j > 1. 

'-.-^ 

Wj 


and assume that vq ^ v^. We are interested in whether Vj approaches vj as j 
To this end suppose that K is chosen so that \X — K\ < 1. Then note that 


oo. 


'"i+i 


= Xv]+K (yj - v]). 


=0 


Hence Cj = Vj — vj satisfies 

ej+i = (A - K)ej 

and 

\ej+i\ = |A-A:||ej|. 

Since we have chosen K so that |A — ifl < 1 then we have \ej\ —>■ 0 as j —> oo. Thus the 
controlled dynamical system approaches the solution of the uncontrolled dynamical system as 
j ^ oo. This is prototypical of certain data assimilation algorithms that we will study in 
Chapter^ 4|k 

It is also of interest to consider continuous time controllers {■ic(t)}t>o for differential equa¬ 
tions 

t + ”■ 

Again, the goal is to choose w to achieve some objective analogous to those described in 
discrete time. 


1.3 Probability Metrics 

Since we will frame data assimilation in terms of probability, natural measures of robustness of 
the problem will require the idea of distance between probability measures. Here we introduce 
basic metric properties, and then some specific distances on probability measures, and their 
properties. 
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1.3.1. Metric Properties 

Definition 1.25 A metric on a set X is a function d : X x X ^ (distance) satisfying 
the following properties: 

• coincidence: d{x,y) =0 iff x = y; 

• symmetry: d{x,y) = d{y,x)\ 

• triangle: d{x,z) < d{x,y) + d{y,z). 

Example 1.26 Let X = viewed as a normed vector space with norm || • ||; for example 
we might take || • || = | • |, the Euclidean norm. Then the function d : R.^ x > R"*" given by 
d{x,y) := ||x — 2/11 defines a metric. Indeed 

• Ik - 2/11 =0 iff X = y. 

• Ik-2/11 = \\y-x\\. 

• Ik - 2|| = Ik - 2/ + 2/ - ^^11 < Ik - 2/11 + Ik - ^11- 

from properties of norms. ^ 


1.3.2. Metrics on Spaces of Probability Measures 

Let M denote the space of probability measures on R^ with strictly positive Lebesgue density 
on R^. Throughout this section we let p, and p' be two probability measures on At, and let 
p and p' denote the corresponding densities; recall that we assume that these densities are 
positive everywhere, in order to simplify the presentation. We define two useful metrics on 
probability measures. 


Definition 1.27 The total variation distance on A4 is defined by 

dTv{p,p') = l: I \p{u) - p'{u)\du 


= -E'" 
2 


1 - 


p'iu) 


p{u) 


Thus the total variation distance is half of the norm of the difference of the two pdfs. 
Note that clearly dTv{p,p') > 0. Also 


drrvip, p') < 1: [ \p{u)\du+l- [ \p{u)\du 
^ JW- ^ 

p{u)du+— [ p'{u)du 

• 2 Jjjf 


= 1 . 


Note also that c/tv may be characterized as 

dTv{p,p) = isup|^|_^<i|E^(/)-E^'(/)| = isup|^|^<i|M(/)-/^'(/)l (1-12) 

where we have used the convention that p{f) = E^(/) = f^e f{v)p[dv) and |/|oo = sup„ |/(u)|. 
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Definition 1.28 The Hellinger distance on Ai is defined by 







Thus the Hellinger distance is a multiple of the distance between the square-roots of 
the two pdfs. Again clearly dHeii(Mi t') ^ 0. Also 

dHeii(M, 

^ Jr‘ 

We also note that the Hellinger and TV distances can be written in a symmetric way and sat¬ 
isfy the triangle inequality - they are indeed valid distance metrics on the space of probability 
measures. 


Lemma 1.29 The total variation and Hellinger distances satisfy 

0 < —M ) — duellih'j h- ) M ^ 


Proof The upper and lower bounds of, respectively, 0 and 1 are proved above. We show first 
that -^dT,y{fi, yi') < da^ifiij,, la'). Indeed, by the Cauchy-Schwarz inequality. 



— \/2diien{h'j h- )■ 


Finally, for the inequality dneuih-, h-') < d^viti-, note that 

\y/a — Vb\ < y/a + Vb Va, &>0. 

Therefore, 



— ^tv(Mi M )■ 
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□ 

Why do we bother to introduce the Hellinger distance, rather than working with the more 
familiar total variation? The answer stems from the following two lemmas. 

Lemma 1.30 Let / : —>■ R?' be sueh that 

{W\f{ur+W'\f{u)\^)<^. 

Then 

|E'^/(m)-E^'/WI (1-13) 

As a consequence 

\Wf{u)-W f{u)\<2{W\f{u)\'^ +W\f{u)\'^)"^dUT,T')^^- (1-14) 

Proof In the following all integrals are over R^. Now 
|E''/(m)-E^'/(m)| < J \fiu)\\p{u) - p’{u)\du 

= J V2|/(w)||\/K^+ \/p'(w)l • \/p'iu)\du 

< (^J 2|/(u)n\/pH + y/p'{u)\^du^ J |a/p(u) - ^/p'{u)\^du^ 

< A\f{u)\^{p{u) +p'{u))du'^ J (^- p(u)du 

= 2(E'^lf(u)i^ +E'^'lf(u)nidHen(p,p'). 

Thus (I1.13|) follows. The bound (I1.14|) follows from Lemma [1.291 □ 

Remark 1.31 The preceding lemma shows that, if two measures p and p' are 0{e) close in 
the Hellinger metric, and if the function f{u) is square integrable with respect to u distributed 
according to p and p', then expectations of f{u) with respect to p and p' are also 0{e) close. 
It also shows that, under the same assumptions on f, if two measures p and p' are 0{e) 
close in the total variation metric, then expectations of f(u) with respect to p and p' are only 
0{e^) close. This second result is sharp and to get 0{e) closeness of expectations using 0[e) 
closeness in the TV metric requires a stronger assumption on f, as we now show. ^ 

Lemma 1.32 Assume that \ f\ is finite almost surely with respect to both p and p' and denote 
the almost sure upper bound on |/| by /max- Then 

\El^f{u)-Ef^'f{u)\ < 2/maxdTv(M,//')- 



Proof Under the given assumption on /, 

|E'"/(u)-E^'/( m)| < J \fiu)\\p{u) - p'{u)\du 

2 /max Q J \p{u) - p'{u)\du 
p'{u) 


< 


<2/n 


1 - 


p{u) 


p{u)du 


= 2fn,^^d TV {p,p'). 
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□ 

The implication of the preceding two lemmas and remark is that it is natural to work with 
the Hellinger metric, rather than the total variation metric, whenever considering the effect 
of perturbations of the measure on expectations of functions which are square integrable, but 
not bounded. 


1.4 Probabilistic View of Dynamical Systems 

Here we look at the natural connection between dynamical systems, and the underlying dy¬ 
namical system that they generate on probability measures. The key idea here is that the 
Markovian propagation of probability measures is linear, even when the underlying dynam¬ 
ical system is nonlinear. This advantage of linearity is partially offset by the fact that the 
underlying dynamics on probability distributions is infinite dimensional, but it is nonetheless 
a powerful perspective on dynamical systems. Example If. 151 provides a nice introductory ex¬ 
ample demonstrating the probability distributions carried by a stochastic dynamical system; 
in that case the probability distributions are Gaussian and we explicitly characterize their 
evolution through the mean and covariance. The idea of mapping probability measures un¬ 
der the dynamical system can be generalized, but is typically more complicated because the 
probability distributions are typically not Gaussian and not characterized by a finite number 
of parameters. 

1.4.1. Markov Kernel 

Definition 1.33 p : x — >■ IR+ is a Markov kernel if: 

• for each x G p{x, ■) is a probability measure on (R^, H(R^)); 

• a; I—>■ p{x, A) is -measurable for all A G H(R^). 

The first condition is the key one for the material in this book: the Markov kernel at fixed x 
describes the probability distribution of a new point y ^ p{x, ■). By iterating on this we may 
generate a sequence of points which constitute a sample from the distribution of the Markov 
chain, as described below, defined by the Markov kernel. The second measurability condition 
ensures an appropriate mathematical setting for the problem, but an in-depth understanding 
of this condition is not essential for the reader of this book. In the same way that we use P 
to denote both the probability measure and its pdf, we sometimes use p{x, •) : —>• R+, for 
each fixed x S R^, to denote the corresponding pdf of the Markov kernel from the preceding 
definition. 

Gonsider the stochastic dynamical system 


Vj+i = 

where f = {Cj}jei,+ is an i.i.d. sequence distributed according to probability measure on R^ 
with density p(-). We assume that the initial condition vq is possibly random, but independent 
of f. Under these assumptions on the probabilistic structure, we say that {vj}j^z+ is a Markov 
chain. For this Markov chain we have 
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thus 


V{vj +1 e A\vj) = / p{vj+i - 'Sivj)) dv. 
J A 


In fact we can define a Markov Kernel 


with the associated pdf 
Vj Pj with pdf pj then 


p{u,A) 



dv, 


p{u, v) = p{v — ■ 


Pj+i = P(uj+i € A) 

= f Hvj+i e A\vj)¥{vj) dvj 

= / p{u, A)pj{u) du. 

Jr‘ 


And then 


Pj+i(,v)= / p(u,v)pj(u)du 

= / p{y —'^{u)')pj{u) du. 

Furthermore we have a linear dynamical system for the evolution of the pdf 

Pj + l = PPjy 

where P is the integral operator 

{Ptt){v) = / p(w — d/(u)^7r(u) du. 

Example 1.34 Let dt : —>• Assume that ~ N(Q,a‘^I). Then 


(27r)«/ 


As a ^ oo we obtain the deterministic model 


Pj+i{v) = / 5[v - 'i!{u))pj{u)du. 
JR>- 


(1.15) 


For each integer n € N, we use the notation •) to denote the Markov kernel arising 
from n steps of the Markov chain; thus p^(u,-) = p{u,-). Furthermore, p'^(u,A) = G 

= u). 
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1.4.2. Ergodicity 

In many situations we will appeal to ergodic theorems to extract information from sequences 
generated by a (possibly stochastic) dynamical systems. Assume that this dynamical 
systems is invariant with respect to probability measure ^ao- Then, roughly speaking, an 
ergodic dynamical systems is one for which, for a suitable class of test functions t/j : —>■ K., 

and vq almost surely with respect to the invariant measure giao, the Markov chain from the 
previous subsection satisfies 

‘P{v)n^{dv) =E^°°ip{v). (1.16) 

We say that the time average eguals the space average. The preceding identity encodes the 
idea that the histogram formed by a single trajectory {uj } of the Markov chain looks more and 
more like the pdf of the underlying invariant measure. Since the convergence is almost sure 
with respect to the initial condition, this implies that the statistics of where the trajectory 
spends time is, asymptotically, independent of the initial condition; this is a very powerful 
property. 

If the Markov chain has a unique invariant density paa , which is a fixed point of the linear 
dynamical system (ll.lSp . then it will satisfy 

Poo = Ppoo, (1-17) 


or equivalently 


Poo{v) = / p{u,v)poo{u)du. 

Js.‘ 


(1.18) 


In the ergodic setting, this equation will have a form of uniqueness within the class of pdfs 
and, furthermore, it is often possible to prove, in some norm, the convergence 


Pj -t Poo as j -)■ oo. 

Example 1.35 Examvle \1.15\ generates an ergodic Markov chain {'Cj}jgz+ carrying the se¬ 
quence of pdfs Pj. Furthermore, each pj is the density of a Gaussian N{mj,u‘^). // |A| < 1 
then TTij —>■ 0 and cr| —>■ where 


Thus Poo is the density of a Gaussian N{0,a^). We then have, 


1 y 1 ( \ P 

7 T{vj) = 7 XI T’ ^^^0 + X ^ / Poo{v)p{v) dv. 

j=i j=i \ i=i J d«. 




1.4.3. Bayes’ Formula as a Map 

Recall that Bayes’ formula states that 

P(a|^) = ^P(6|a)P(a). 
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This may be viewed as a map from P(a) (what we know about a a priori, the prior) to P(a|6) 
(what we know about a once we have observed the variable b, the posterior.) Since 


F{b) = [ P(6|a)P(a) da 


we see that 


P(a|6) 


P(6|a)P(a) 
P(6|a)P(a) da 


LF{a). 


L is a nonlinear map which takes pdf P(a) into P(a|6). We use the letter L to highlight the 
fact that the map is defined, in the context of Bayesian statistics, by using the likelihood to 
map prior to posterior. 


1.5 Bibliography 

• For background material on probability, as covered in section [Tm the reader is directed 
to the elementary textbook [20] , and to the more advanced texts (50] 1120] for further 
material (for example the definition of measurable.) The book [S^j, together with the 
references therein, provides an excellent introduction to Markov chains. The book m 
is a comprehensive study of ergodicity for Markov chains; the central use of Lyapunov 
functions will make it particular accessible for readers with a background in dynamical 
systems. Note also that Theorem 13.31 contains a basic ergodic result for Markov chains. 

• Section 11.21 concerns dynamical systems and stochastic dynamical systems. The de¬ 
terministic setting is over-viewed in numerous textbooks, such as mm. with more 
advanced material, related to infinite dimensional problems, covered in [109] . The er¬ 
godicity of stochastic dynamical systems is over-viewed in |7] and targeted treatments 
based on the small noise scenario include min]. The book [107] contains elementary 
chapters on dynamical systems, and the book chapter [59] contains related material 
in the context of stochastic dynamical systems. For the subject of control theory the 
reader is directed to m. which has a particularly good exposition of the linear theory, 
and [104j for the nonlinear setting. 

• Probability metrics are the subject of section 11.31 and the survey paper [?7| provides a 
very readable introduction to this subject, together with references to the wider litera¬ 
ture. 

• Viewing (stochastic) dynamical systems as generating a dynamical system on the prob¬ 
ability measure which they carry is an enormously powerful way of thinking. The reader 
is directed to the books |117] and |S] for overviews of this subject, and further references. 

1.6 Exercises 

1. Consider the ODE 

d'l’ 3 

— =V-V^^ V{0}=vo. 

By finding the exact solution, determine the one-parameter semigroup '!'(•; t) with prop¬ 
erties (11.101) . 
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2. Consider a jointly varying random variable (a, b) G defined as follows: a ^ tV(0, a^) 
and b\a ~ A^(a, 7^). Find a formula for the probability density function of {a,b), using 
([Lfibb and demonstrate that the random variable is a Gaussian with mean and covari¬ 
ance which you should specify. Using (II3, find a formula for the probability density 
function of a\b; again demonstrate that the random variable is a Gaussian with mean 
and covariance which you should specify. 

3. Consider two Gaussian densities on R: N{mi,a'l) and iV(m2,CT|). Show that the 
Hellinger distance between them is given by 


dneii (a*i /^ ) — 1 



{mi — m2Y\ 2(71 (72 

2{al+al)){al+aiy 


4. Consider two Gaussian measures on R: N{mi,a\) and 7V(m2,(7|). Show that the total 
variation distance between the measures tends to zero if m2 —>■ mi and (7| —>■ cr^. 

5. The Kullback-Leibler divergence between two measures /r' and n, with pdfs p' and p 
respectively, is 

Dkl{p\\p) = j log(^^^)p'(a;)(fa:. 

Does Dkl define a metric on probability measures? Justify your answer. Consider two 
Gaussian densities on R: 7V(mi,(7i) and 7V(m2,(7|). Show that the Kullback-Leibler 
divergence between them is given by 


£’KL(Aii||/^2) = In 



(m2 — mi)^ 

2al 


6. Assume that two measures p, and p' have positive Lebesgue densities p and p' respec¬ 
tively. Prove the bounds: 

dueiiip, P ) ^ ’ d.TvifJ'T fJ' ) ^ Df^i^{p\\p ), 

where the Kullback-Leibler divergence Dkl is defined in the preceding exercise. 

7. Consider the stochastic dynamical system of Example 11.151 Find explicit formulae for 
the maps rrij 1—>■ m^+i and cr^ 1—>■ 

8. Directly compute the mean and covariance of re = a -I- Lz if 2 is Gaussian N(m,C), 
without using the characteristic function. Verify that you obtain the same result as in 
Lemma O 

9. Prove Lemma [TT51 

10. Generalize Definitions 11.201 and ll.23| to continuous time, as suggested in Remark |1.21l 
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Chapter 2 


Discrete Time: Formulation 


In this chapter we introduce the mathematical framework for discrete-time data assimilation. 
Section 12.11 describes the mathematical models we use for the underlying signal, which we 
wish to recover, and for the data, which we use for the recovery. In section [2.21 we introduce 
a number of examples used throughout the text to illustrate the theory. Sections 12.31 and 
12.41 respectively describe two key problems related to the conditioning of the signal v on 
the data y, namely smoothing and filtering; in section 1^31 we describe how these two key 
problems are related. Section [2. bl nroves that the smoothing problem is well-posed and, using 
the connection to filtering described in 12.51 that the filtering problem is well-posed; here 
well-posedness refers to continuity of the desired conditioned probability distribution with 
respect to the observed data. Section 12.71 discusses approaches to evaluating the quality of 
data assimilation algorithms. In section [2^ we describe various illustrations of the foregoing 
theory and conclude the chapter with section 12.91 devoted to a bibliographical overview and 
section [2. 101 containing exercises. 


2.1 Set-Up 

We assume throughout the book that 'k G C(]R"',R”) and consider the Markov chain v = 
defined by the random map 

Vj+i = -k j G Z+, (2.1a) 

Vq ~ N{mo,CQ), (2-lb) 

where ^ = {0}i6Z+ is an i.i.d. sequence, with -^(0,S) and E > 0. Because (vq,^) is a 
random variable, so too is the solution sequence {vj}j^ 2 +' th® signal, which determines the 
state of the system at each discrete time instance. For simplicity we assume that vq and ^ are 
independent. The probability distribution of the random variable v quantifies the uncertainty 
in predictions arising from this stochastic dynamics model. 

In many applications, models such as m are supplemented by observations of the system 
as it evolves; this information then changes the probability distribution on the signal, typically 
reducing the uncertainty. To describe such situations we assume that we are given data, or 
observations, y = {yj}jGN defined as follows. At each discrete time instance we observe a 
(possibly nonlinear) function of the signal, with additive noise : 


yj+i = h{vj+i) -k Vj+i, j G Z+, 


( 2 . 2 ) 
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where h G and 77 = {%}jeN is an i.i.d. sequence, independent of (vg,^), with 

rji ^ 7V(0,r) and F > 0. The function h is known as the observation operator. The 
objective of data assimilation is to determine information about the signal v, given data y. 
Mathematically we wish to solve the problem of conditioning the random variable v on the 
observed data y, or problems closely related to this. Note that we have assumed that both 
the model noise ^ and the observational noise y are Gaussian; this assumption is made for 
convenience only, and could be easily generalized. 

We will also be interested in the case where the dynamics is deterministic and m 
becomes 


Vj+i = 'i'ivj), j G Z+, (2.3a) 

vqN{ mo,Co). (2.3b) 

In this case, which we refer to as deterministic dynamics, we are interested in the random 
variable vg, given the observed data y; note that vg determines all subsequent values of the 
signal V. 

Finally we mention that in many applications the function T is the solution operator for 
an ordinary differential equation (ODE) of the form0 

/(t;), t € (0,oo), (2.4a) 

Vg. (2-4b) 

Then, assuming the solution exists for all t > 0, there is a one-parameter semi-group of 
operators '!'(•;f), parametrized by time t > 0, with properties defined in (11.101) . In this 
situation we assume that 4 '(m) = 4'(M;r), i.e. the solution operator over r time units, where 
T is the time between observations; thus we implicitly make the simplifying assumption that 
the observations are made at equally spaced time-points, and note that the state vj = v{jh) 
evolves according to (I2.3a|) . We use the notation 4/0)(.) to denote the j—fold composition of 
with itself. Thus, in the case of continuous time dynamics, '!'(•; jr) = 4 ' 0 )(.). 


dv 

dt 

v{Q) = 


2.2 Guiding Examples 

Throughout these notes we will use the following examples to illustrate the theory and algo¬ 
rithms presented. 

Example 2.1 We consider the case of one dimensional linear dynamics where 

'^{v) = Xv (2.5) 

for some scalar A G R. Figure \2.1\ compares the behaviour of the stochastic dynamics a 
and deterministic dynamics i2.3\} for the two values A = 0.5 and A = 1.05. We set S = cr^ 
and in both cases 50 iterations of the map are shown. We observe that the presence of noise 
does not significantly alter the dynamics of the system for the case when |A| > 1, since for 
both the stochastic and deterministic models \vj \ —>-00 as j ^ 00 . The effects of stochasticity 
are more pronounced when |A| < 1, since in this case the deterministic map satisfies Vj —^ 0 
whilst, for the stochastic model, Vj fluctuates randomly around 0. 


^Here the use of u = {u(t)}t>o for the solution of this equation should be distinguished from our use of 
V = {vj }|Fg for the solution of 112.Ill . 
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(a) A = 0.5 (b) A = 1.05 

Figure 2.1; Behaviour of (EH) for VP given by (12.51) for different values of A and S = tr^. 


Using i2.1h ). together with the linearity of dl and the Gaussianity of the noise we 
obtain 

E(z;,+i) = AE(i;,), E(u|+i) = A2 E(i;|) + 


If |A| > 1 then the the second moment explodes as j —>■ c», as does the modulus of the first 
moment ifM(vo) 0. On the other hand, if |A| < 1, we see fExamvle \1.15\) that E(i;j) —^ 0 
and E(i;|) —?► cr^ where 


<j 


2 

OO 


1-A2- 


( 2 . 6 ) 


Indeed, since vq is Gaussian, the model i2.1b ) with linear and Gaussian noise gives rise 
to a random variable Vj which is also Gaussian. Thus, from the convergence of the mean 
and the second moment ofvj, we conclude that Vj converges weakly to the random variable 
N{0,a'^). This is an example of ergodicity as expressed in (I1.16|) .' the invariant measure poo 
is the Gaussian iV(0, cr^) and the density is the Lebesgue density of this Gaussian. 4|k 


Example 2.2 Now consider the case of two dimensional linear dynamics. In this case 


vl/(w) = Av, 


(2.7) 


with A a 2 X 2 dimensional matrix of one of the following three forms Ae: 




Ai 0 A 
A2 


A a A 
0 A j ’ 


^3 


0 

-1 


1 

0 


For i = 1,2 the behaviour of i2.1\) for 'I'(u) = A^u can be understood from the analysis 
underling the previous Examvle \2. 1\ and the behaviour is similar, in each coordinate, depending 
on whether the A value on the diagonal is smaller than, or larger than, 1. However, the picture 
is more interesting when we consider the third choice 4'(m) = A^u as, in this case, the matrix 
^3 has purely imaginary eigenvalues and corresponds to a rotation by t: 12 on the plane; this 
is illustrated in Figure\2f^. Addition of noise into the dynamics gives a qualitatively different 
picture: now the step j to j + 1 corresponds to a rotation by t:/ 2, composed with a random 
shift of origin; this is illustrated in Figure 
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Example 2.3 We now consider our first nonlinear example, namely the one-dimensional 
dynamics for which 

'I'(w) = asinti. ( 2 . 8 ) 

Figure 1^.51 illustrates the behaviour of i2.1\) for this choice of and with a = 2.5, both 
for deterministie and stochastic dynamics . In the case of deterministic dynamics, Figure 
rob . we see that eventually iterates of the discrete map converge to a period 2 solution. 
Although only one period 2 solution is seen in this single trajectory, we can deduce that there 
will be another period 2 solution, related to this one by the symmetry u i—>■ —u. This second 
solution is manifest when we consider stochastic dynamics . Fiaure WTSb demonstrates that the 



i i 

(a) Deterministic dynamics (b) Stochastic dynamics, cr = 0.25 

Figure 2.3: Behaviour of m for given by (12.81) for a = 2.5 and E = cr^, see also pi. m in 
section 15.1.11 

inclusion of noise significantly changes the behaviour of the system. The signal now exhibits 
bistable behaviour and, within each mode of the behavioural dynamics, vestiges of the period 
2 dynamics may be seen: the upper mode of the dynamics is related to the period 2 solution 
shown in Figure \2.3h and the lower mode to the period 2 solution found from applying the 
symmetry u i—>■ —u to obtain a second period 2 solution from that shown in Figure 
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A good way of visualizing ergodicity is via the empirical measure or histogram, gener¬ 
ated by a trajectory of the dynamical system. Eguation (11.1611 formalizes the idea that the 
histogram, in the large J limit, converges to the probability density function of a random 
variable, independently of the starting point vq ■ Thinking in terms of pdfs of the signal, or 
functions of the signal, and neglecting time-ordering information, is a very useful viewpoint 
throughout these notes. 

Histograms visualize complex dynamical behaviour such as that seen in Figure \2.,% by 
ignoring time-correlation in the signal and simply keeping track of where the solution goes 
as time elapses, but not the order in which places are visited. This is illustrated in Figure 
where we plot the histogram corresponding to the dynamics shown in Figure \2.3b . but 
calculated using a simulation of length J = 10^. We observe that the system quickly forgets 
its initial condition and spends an almost equal proportion of time around the positive and 
negative period 2 solutions of the underlying deterministic map. The Figure \K^ would change 
very little if the system were started from a different initial condition, reflecting ergodicity of 
the underlying map. 
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Figure 2.4: Probability density functions for Vj,j = Q,"- ,J, for J = 10^ 


Example 2.4 We now consider a second one-dimensional and nonlinear map, for which 

^![v) = rv{l — v). (2.9) 

We consider initial data vq € [0,1] noting that, for r £ [0,4], the signal will then satisfy 
Vj £ [0,1] for all j, in the case of the deterministic dynamics (12.31) . We confine our discussion 
here to the deterministic case which can itself exhibit quite rich behaviour. In particular, the 
behaviour of i2.Sl \2.9\} can be seen in Figure [^751 for the values of r = 2 and r = 4. These 
values of r have the desirable property that it is possible to determine the signal analytically. 
For r = 2 one obtains 

V, = \-\{l-2vgf, (2.10) 

which implies that, for any value of vg 0,1, Vj —?► 1/2 as we can also see in Figure [^Tiih . 
For uo = 0 the solution remains at the unstable fixed point 0, whilst for fg = 1 the solution 
maps onto 0 in one step, and then remains there. In the case r = 4 the solution is given by 

= 4 sin^(2^7r0), with wq = 4 sin^(7r0) (2-11) 
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(a) Deterministic dynamics, r = 2 (b) Deterministic dynamics, r = 4 

Figure 2.5: Behaviour of (j2.1ll for 'I' given by (I2.9|l . 


This solution can also be expressed in the form 


= sm^(27rZj). 


( 2 . 12 ) 


where 


Zj+l = 


2zj, 

2zj — 1 , 


0<Zj< i, 

i < Z,- < 1, 


and using this formula it is possible to show that this map produces chaotic dynamics for 
almost all initial conditions. This is illustrated in Figure \ 2.5h . where we plot the first 100 
iterations of the map. In addition, in Figure \2.4^ , we plot the pdf using a long trajectory of 
Vj of length J = 10^, demonstrating the ergodicity of the map. In fact there is an analytic 
formula for the steady state value of the pdf (the invariant density) found as J ^ oo; it is 
given by 

p{x) = . (2-13) 


Example 2.5 Turning now to maps 'h derived from differential eguations, the simplest case 
is to consider linear autonomous dynamical systems of the form 

^ = Lv, (2.14a) 

n(0) = vq. (2.14b) 

Then 4'(u) = Au with A = exp(LT). ^ 

Example 2.6 The Lorenz ’63 model is perhaps the simplest continuous-time system to exhibit 

sensitivity to initial conditions and chaos. It is a system of three coupled non-linear ordinary 
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(a) ui vs U 2 (b) M 2 vs M 3 


Figure 2.6: Projection of the Lorenz’63 attractor onto two different pairs of coordinates. 


differential equations whose solution v G where v = (vi,V 2 ,V 3 ), satisfie.^ 

a(v 2 — vi), (2.15a) 

—avi—V2 — viV3, (2.15b) 

viV 2 — bv 3 — b{r + a). (2.15c) 

Note that we have employed a coordinate system where the origin in the original version of 
the equations proposed by Lorenz is shifted. In the coordinate system that we employ here we 
have equation (j2.4ll with vector field f satisfying 

{f{v),v) <a-/3\v\'^ (2.16) 


dvi 

dt 

dv2 

dt 

dV3 

dt 


for some a,/3 > 0. As demonstrated in Examvle A 1.221 this implies the existence of an absorb¬ 
ing set: 

limsup |u(t)|^ < i? (2-17) 

t—¥CO 

for any R > a/ft. Mapping the ball B{0,R) forward under the dynamics gives the global 
attractor (see Definition \1.23\} for the dynamics. In Figure [276\ we visualize this attractor, 
projected onto two different pairs of coordinates at the classical parameter values (a, b, r) = 
(10,|,28). 

Throughout these notes we will use the classical parameter values {a,b,r) = (10,|,28) in 
all of our numerical experiments; at these values the system is chaotic and exhibits sensitive 
dependence with respect to the initial condition. A trajectory of vi versus time can be found 
in Figure \2. 7| a and in Figure 12.7f > we illustrate the evolution of a small perturbation to the 
initial condition which generated Figure 12. 7| a; to be explicit we plot the evolution of the error 
in the Fuclidean norm \ ■ |, for an initial perturbation of magnitude 10“^. Figure\2f^ suggests 
that the measure p-ao is supported on a strange set with Lebesgue measure zero, and this is 
indeed the case; for this example there is no Lebesgue density poo for the invariant measure, 
reflecting the fact that the attractor has a fractal dimension less than three, the dimension of 
the space where the dynamical system lies. 

^Here index denotes components of the solution, not discrete time. 
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(a) ui as a function of time (b) Evolution of error for a small perturbation 


Figure 2.7: Dynamics of the Lorenz’63 model in the chaotic regime (a, 6, r) = (10, 28) 



Figure 2.8: Dynamics of the Lorenz’96 model in the chaotic regime {F^K) = (8,40) 


4 

Example 2.7 The Lorenz ’96 model is a simple dynamical system, of tunable dimension, 
which was designed as a caricature of the dynamics of Rosshy waves in atmospheric dynamics. 
The equations have a periodic “ring” formulation and take the /orr?^ 

^ = Vk-i{vk+i-Vk-2)-Vk + F, k e {!,■■■ ,K}, (2.18a) 

Vo = VK, VK+l=Vi, V-i=VK-l. (2.18b) 

Equation (j2.18l) satisfies the same dissipativity property (12.1611 satisfied by the Lorenz ’63 
model, for appropriate choice of a, (3 > 0, and hence also satisfies the absorbing ball property 
(j2.17l) thus having a global attractor (see DeHnition \1.23\) . 

In Figure we plot a trajectory ofvi versus time for F = 8 and iF = 40. Furthermore, 
as we did in the case of the Lorenz ’63 model, we also show the evolution of the Euclidean 

^Again, here index denotes components of the solution, not discrete time. 
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(a) VI vs vk (b) Di vs vk —1 

Figure 2.9: Projection of the Lorenz’96 attractor onto two different pairs of coordinates. 


norm of the error \ ■ \ for an initial perturbation of magnitude 10“^; this is displayed in Figure 
\2.8b and clearly demonstrates sensitive dependence on initial conditions. We visualize the 
attractor, projected onto two different pairs of coordinates, in Figure \2.9l 


2.3 Smoothing Problem 

2.3.1. Probabilistic Formulation of Data Assimilation 

Together (EH) and (1221) provide a probabilistic model for the jointly varying random variable 
{v, y). In the case of deterministic dynamics, (12.31) and (12.21) provide a probabilistic model for 
the jointly varying random variable (yo,y). Thus in both cases we have a random variable 
{u,y), with u = V (resp. u = Vq) in the stochastic (resp. deterministic) case. Our aim is 
to find out information about the signal v, in the stochastic case, or vq in the deterministic 
case, from observation of a single instance of the data y. The natural probabilistic approach 
to this problem is to try and hnd the probability measure describing the random variable u 
given y, denoted u\y. This constitutes the Bayesian formulation of the problem of determining 
information about the signal arising in a noisy dynamical model, based on noisy observations 
of that signal. We will refer to the conditioned random variable u\y, in the case of either 
the stochastic dynamics or deterministic dynamics, as the smoothing distribution. It is a 
random variable which contains all the probabilistic information about the signal, given our 
observations. The key concept which drives this approach is Bayes’ formula from subsection 
11.1.41 which we use repeatedly in what follows. 

2.3.2. Stochastic Dynamics 

We wish to find the signal v from m from a single instance of data y given by (1221) . To 
be more precise we wish to condition the signal on a discrete time interval Jo = {0,...,J}, 
given data on the discrete time interval J = {1,..., J}; we refer to Jo as the data assimilation 
window. We define v = {vj}j^ja,y = {yj}j^i,f, = {CibeJo and y = The smoothing 

distribution here is the distribution of the conditioned random variable v\y. Recall that we 
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have assumed that uqiC 77 are mutually independent random variables. With this fact in 
hand we may apply Bayes’ formula to find the pdf P(u|y). 

Prior The prior on v is specified by (EU), together with the independence of u and f and 
the i.i.d. structure of First note that, using (O and the i.i.d. structure of ^ in turn, we 
obtain 


P(-i;) = F{vj,vj-i, ■■■ ,vo) 

= • • • , vo)F{vj-i, ■■■ ,vo) 

= P(?;j|uj_i)P(uj_i,--- ,wo)- 


Proceeding inductively gives 


. 7-1 

P(u) = Y[ 


7=0 


Now 


P(uo) oc exp(^-i|Co ’'(uo-mo)l^) 


whilst 




The probability distribution P(u) that we now write down is not Gaussian, but the distribution 
on the initial condition P(uo), and the conditional distributions P(uj+i|uj), are all Gaussian, 
making the explicit calculations above straightforward. 

Combining the preceding information we obtain 


P(u) oc exp(—J(u)) 


where 

J(u) := 5 |Go'"(uo-too)|VE 7 =o (2.19a) 

= 2^0 - +E/=oH|^7+i - ^(^7)|e- (2.19b) 

The pdf P(z;) = po{v) proportional to exp(—J(?;)) determines a prior measure po on 

The fact that the probability is not, in general, Gaussian follows from the fact that T is not, 

in general, linear. 

Likelihood The likelihood of the data y\v is determined as follows. It is a (Gaussian) 
probability distribution on with pdf P(y|u) proportional to exp(—^(u; y)), where 

, 7-1 ^ 

Hv-,y) = '^-\yj+i-h{vj+i)\l. ( 2 . 20 ) 

7=0 

To see this note that, because of the i.i.d. nature of the sequence 77 , it follows that 

j-i 

ip( 2 /k) = n 

7=0 

J-1 

= n 'P(2/7 +iI^^7+i) 

7=0 

,7-1 ^ 

oc ]q exp(^--|r-^(?/j+i -/i(uj+i))|^) 

7=0 

= exp(-$(i;;y)). 


34 




In the applied literature mo and Cq are often referred to as the background mean and 
background covariance respectively; we refer to $ as the model-data misfit functional. 

Using Bayes’ formula (ini) we can combine the prior and the likelihood to determine the 
posterior distribution, that is the smoothing distribution, on v\y. We denote the measure with 
this distribution by y. 

Theorem 2.8 The posterior smoothing distribution on v\y for the stochastic dynamics model 
(1^ . (1^ is a probability measure pL on pdf P{v\y) = p{v) proportional to 

exp(—l(z;; y)) where 

l(z;;2/) = J(?;)(2.21) 

Proof Bayes’ formula (ini) gives us 

Thus, ignoring constants of proportionality which depend only on y, 

P(i;|y) oc P(y|?;)P(no) 

oc exp(—?/)) exp(— 

= exp(-l(u;y)). 


□ 

Note that, although the preceding calculations required only knowledge of the pdfs of 
Gaussian distributions, the resulting posterior distribution is non-Gaussian in general, unless 
and h are linear. This is because, unless 'h and h are linear, l(•;y) is not quadratic. We 
refer to I as the negative log-posterior. It will be helpful later to note that 

oc exp(-$('(;; y)). (2.22) 

Po{_v) 

2.3.3. Reformulation of Stochastic Dynamics 

For the development of algorithms to probe the posterior distribution, the following reformu¬ 
lation of the stochastic dynamics problem can be very useful. For this we define the vector 
C = (^Oi ^0) ■fi, • • • 1 Cj-i) G The following lemma is key to what follows. 

Lemma 2.9 Define the mapping G : i—>• RlUIxn 

=Vj, j = O,--- , J, 

where Vj is determined by (EH). Then this mapping is invertible. Furthermore, if 'h = 0, 
then G is the identity mapping. 

Proof In words the mapping G takes the initial condition and noise into the signal. Invert- 
ibility requires determination of the initial condition and the noise from the signal. From 
the signal we may compute the noise as follows noting that, of course, the initial condition is 
specified and that then we have 

- ^(wj), j = 0, 

The fact that G becomes the identity mapping when 'h = 0 follows directly from (|2.1I) by 
inspection. □ 
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We may thus consider the smoothing problem as finding the probability distribution of 
as defined prior to the lemma, given data y, with y as defined in section [2.3.21 Furthermore 
we have, using the notion of pushforward, 

P(u|y) = G*P(C|2/), P(e|2/) = G-i*P(u|y). (2.23) 


These formulae mean that it is easy to move between the two measures: samples from one 
can be converted into samples from the other simply by applying G or G~^. This means that 
algorithms can be applied to, for example, generate samples from ^\y, and then convert into 
samples from v\y. We will use this later on. In order to use this idea it will be helpful to have 
an explicit expression for the pdf of ^\y. We now find such an expression. 

To start we introduce the measure do with density ttq found from y,Q and po in the case 
where = 0. Thus 


7ro(u) oc exp -- Gp "(wo - Wo) 


2 1 


j=0 


J-1 


OC exp 


|uo -TOolco “ 


t=0 


(2.24a) 

(2.24b) 


and hence do is a Gaussian measure, independent in each component Vj for j = 0, • • • , J. By 
Lemma l2.9l we also deduce that measure do with density ttq is the prior on ^ as defined above: 


MO « exp I |uo - TOolco “ \\0\l 

\ j=o 

We now compute the likelihood of For this we define 

= h{G,iO) 

and note that we may then concatenate the data and write 

y = Q{0+V 


(2.25) 


(2.26) 


(2.27) 


where y = (pi,--- ,rij) is a the Gaussian random variable N(0,Tj) where Fj is a block 
diagonal nJ x nJ matrix with n x n diagonal blocks F. It follows that the likelihood is 
determined by P(y|^) = 7V(C/(^),Fj). Applying Bayes formula from (11.71) to find the pdf for 
M we find the posterior d on ^|j/, as summarized in the following theorem. 


Theorem 2.10 The posterior smoothing distribution on for the stochastic dynamics 
model (IQ) . (E21) is a probability measure d on with pd/P(^|y) = 7r(^) proportional 

to exp(—lr(^;?/)) where 

\r{0y)=M0 + ^r{0y), (2.28) 

MiOy) :=^l(2/-t;(e))lr. 

and 

Jr(0 := ^ ko - wo|c„ + ^ ■ 

j=o 

We refer to Ir as the negative log-posterior. 
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2.3.4. Deterministic Dynamics 

It is also of interest to study the posterior distribution on the initial condition in the case 
where the model dynamics contains no noise, and is given by (lOl) : this we now do. Recall 
that denotes the j—fold composition of 'I'(-) with itself. In the following we sometimes 

refer to Jdet as the background penalization, and toq and Cq as the background mean and 
covariance; we refer to $det as the model-data misfit functional. 

Theorem 2.11 The posterior smoothing distribution on V{)\y for the the deterministic dy¬ 


namics model (ESI), ES is a probability measure v on M" with density P(no| 2 /) = g(vo) 
proportional to exp(— IdetC^o; 2 /)) where 

Idet(r^o 7 2 /) — Jdet(r^o) “f ^det(r^o 7 2 /)7 (2.29a) 

1 I I 2 

Jdet('yo) = 2 (2.29b) 

/-I 

<^det{vo-,y) = - h(^'(-^+i)(no))|p. (2.29c) 


j=o 

Proof We again use Bayes’ rule which states that 

Thus, ignoring constants of proportionality which depend only on y, 

P(z;o|2/) oc P(?/|-i;o)P(uo) 

oc exp(-$det('i^o;2/)) exp(-i|uo - toqIco) 

= exp(-ldet(?^o;2/))- 

Here we have used the fact that P(?/|?;o) is proportional to exp(—<I)det(r’o; 2/))) this follows 
from the fact that yj\v[) form an i.i.d sequence of Gaussian random variables N{h{vj),T) 
with Vj = □ 

We refer to Idet as the negative log-posterior. 

2.4 Filtering Problem 

The smoothing problem considered in the previous section involves, potentially, conditioning 
Vj on data yk with k > j. Such conditioning can only be performed off-line and is of no use 
in on-line scenarios where we want to determine information on the state of the signal now 
hence using only data from the past up to the present. To study this situation, let Yj = 
denote the accumulated data up to time j. Filtering is concerned with determining 
P(z;j|y)), the pdf associated with the probability measure on the random variable Vj\Yj-, 
in particular filtering is concerned with the sequential updating this pdf as the index j is 
incremented. This update is defined by the following procedure which provides a prescription 
for computing P(t!j+i|lj+i) from ¥{vj\Yj) via two steps: prediction which computes the 
mapping F{vj\Yj) >->• F{vjj.i\Yj) and analysis which computes P(nj-|-ilYj) !->■ F{vj+i\Yjj.i) 
by application of Bayes’ formula. 
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Prediction Note that ^{vj+i\Yj,Vj) = because Yj contains noisy and indirect 

information about Vj and cannot improve upon perfect knowledge of the variable Vj. Thus, 
by dm), we deduce that 


= / Hvj+i\y],Vj)¥{vj\Yj)dvj 
= [ H'i’j+i\vj)¥{vj\Yj)dvj 


(2.30a) 

(2.30b) 


Note that, since the forward model equation (12.ip determines P(z;j-|-i |uj), this prediction step 
provides the map from F{vj\Yj) to P(uj+i|l^). This prediction step simplifies in the case of 
deterministic dynamics (IQ) : in this case it simply corresponds to computing the pushforward 
of P(uj|lj) under the map \E'. 

Analysis Note that V{yj+i\vj+i,Yj) = P(yj+i|uj+i) because Yj contains noisy and indirect 
information about Vj and cannot improve upon perfect knowledge of the variable Uj+i. Thus, 
using Bayes’ formula (II3, we deduce that 


F{vj+i\Yj+i) = F{vj+i\Yj,yj+i) 

nyj+ilYj) 

^ IP(2/j+ikj+i)Pfa+i|y; ) 

ny,+i\Yj) 


(2.31) 


Since the observation equation (12.21) determines P(yj+i |uj+i), this analysis step provides a 
map from F{vj+i\Yj) to P(uj+i|Yj+i). 

Filtering Update Together, then, the prediction and analysis step provide a mapping from 
P(uj|y,) to P(uj+i|y,+i). Indeed if we let yj denote the probability measure on M" corre¬ 
sponding to the density F{vj\Yj) and yj+i be the probability measure on R” corresponding 
to the density F(vjj-i\Yj) then the prediction step maps y-j to yj+i whilst the analysis step 
maps yj+i to yj+i- However there is, in general, no easily usable closed form expression for 
the density of yj, namely F(vj\Yj). Nevertheless, formulae (12.3011 . (12.3111 form the starting 
point for numerous algorithms to approximate F{vj\Yj). In terms of analyzing the particle 
filter it is helpful conceptually to write the prediction and analysis steps as 


d'j+i — y*yj /^t+1 — (2.32) 

Note that P does not depend on j as the same Markov process governs the prediction step 
at each j; however Lj depends on j because the likelihood sees different data at each j. 
Furthermore, the formula yj+i = Pyj summarizes (12.301) whilst yj+i = Ljyj^i summarizes 
(12.311) . Note that P is a linear mapping, whilst Lj is nonlinear; this issue is discussed in 
subsections II. 4. II and 11.4.31 at the level of pdfs. 


2.5 Filtering and Smoothing are Related 

The filtering and smoothing approaches to determining the signal from the data are distinct, 
but related. They are related by the fact that in both cases the solution computed at the end 
of any specified time-interval is conditioned on the same data, and must hence coincide; this 
is made precise in the following. 
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Theorem 2.12 Let P(z;|?/) denote the smoothing distribution on the discrete time interval 
j € Jo, and P(iij|yj) the filtering distribution at time j = J for the stochastic dynamics 
model (EH). Then the marginal of the smoothing distribution on vj is the same as the 
filtering distribution at time J: 


J ¥{v\y)dvodvi...dvj-i = ¥{vj\Yj). 

Proof Note that y = Yj. Since v = {vq, ...,vj-i,vj) the result follows trivially. □ 

Remark 2.13 Note that the marginal of the smoothing distribution on say Vj, j < J is not 
equal to the filter ¥(vj\Yj). This is because the smoother induces a distribution on Vj which 
is influenced by the entire data set Yj = y = {yi}i^j; in contrast the filter at j involves only 
the dataYj = 4 

It is also interesting to mention the relationship between filtering and smoothing in the 
case of noise-free dynamics. In this case the filtering distribution ¥{vj\Yj) is simply found 
as the pushforward of the smoothing distribution on P(uo|I^) under that is under j 

applications of 4'. 

Theorem 2.14 Let P(uo|y) denote the smoothing distribution on the discrete time interval 
j € Jo, and P(uj|yj) the filtering distribution at time j = J for the deterministic dynamics 
model (j2.3ll . Then the pushforward of the smoothing distribution on vq under is the 
same as the filtering distribution at time J: 

^(•^)*P(uo|Tj) =P(uj|Fj). 


2.6 Well-Posedness 


Well-posedness of a mathematical problem refers, generally, to the existence of a nnique 
solution which depends continuously on the parameters defining the problem. We have shown, 
for both filtering and smoothing, how to construct a uniquely dehned probabilistic solution 
to the problem of determining the signal given the data. In this setting it is natural to 
consider well-posedness with respect to the data itself. Thus we now investigate the continuous 
dependence of the probabilistic solution on the observed data; indeed we will show Lipschitz 
dependence. To this end we need probability metrics, as introduced in section [T751 

As we do throughout the notes, we perform all calculations using the existence of every¬ 
where positive Lebesgne densities for our measures. We let /tq denote the prior measure on 
V for the smoothing problem arising in stochastic dynamics, as defined by EH). Then p and 
p' denote the posterior measures resulting from two different instances of the data, y and 
y' respectively. Let pQ,p and p' denote the Lebesgue densities on po,p and p' respectively. 
Then, for J and $ as defined in (12.191) and (12.201) . 


Po{v) 

p{v) 

pfv) 


^exp(-J(u)), 

(2.33a) 

— exp(-J(u) - $(u; 2 /)). 

(2.33b) 

^exp(-J(u) - <i>{v;y')), 

(2.33c) 
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where 



exp(—J(?;))fiu, 

(2.34a) 


1" exp{-^{v) - <P{v;y))dv, 

(2.34b) 

Z' = 

/ exp(—J(n) — <l)(r;;?/'))dti. 

(2.34c) 


Here, and in the proofs that follow in this section, all integrals are over (or, in the 

case of the deterministic dynamics model at the end of the section, over K"). Note that |JIo| 
is the cardinality of the set Jq and is hence equal to J+ 1. To this end we note explicitly that 
(j2.33h l implies that 

exp(—= ZQPQ{v)dv = Zop,o(dv), (2.35) 

indicating that integrals weighted by exp(—J(r;)) may be rewritten as expectations with 
respect to po. We use the identities (12.331) . (12.341) and (12.351) repeatedly in what follows 
to express all integrals as expectations with respect to the measure fiQ. In particular the 
assumptions that we make for the subsequent theorems and corollaries in this section are 
all expressed in terms of expectations under po (or, under vq for the deterministic dynamics 
problem considered at the end of the section). This is convenient because it relates to the 
unconditioned problem of stochastic dynamics for v, in the absence of any data, and may 
thus be checked once and for all, independently of the particular data set y or y' which are 
used to condition v and obtain p and p'. 

We assume throughout what follows that y,y' are both contained in a ball of radius r in 
the Euclidean norm on Again |J| is the cardinality of the set J and is hence equal to 

J. We also note that Zq is bounded from above independently of r, because po is the density 
associated with the probability measure poj which is therefore normalizable, and this measure 
is independent of the data. It also follows that Z < Zi^, Z' < Zq hy using p.35p in p.34b '). 
(j2.34b l. together with the fact that 4)(-;?/) is a positive function. Furthermore, if we assume 
that 

ieJ 

satisfies < oo, then both Z and Z' are positive with common lower bound depending 
only on r, as we now demonstrate. It is sufficient to prove the result for Z, which we now do. 
In the following, and in the proofs which follow, K denotes a generic constant, which may 
depend on r and J but not on the solution sequence u, and which may change from instance 
to instance. Note first that, by (12.341) . (12.351) . 




= J exp(^—^{v,y))po{v)dv> J exp(—A"v)po(w)(iv. 


Since < oo we deduce from (EH), that for R sufficiently large. 


> exp(—ATi?) j po{v)dv = exp(—ArA)P^“(|v| < R) 

^0 J\v\<R 

> exp{-KR) (1 - A-^E'^W). 

Since AT depends on y,y' only through r, we deduce that, by choice of R sufficiently large, 
we have found lower bounds on Z, Z' which depend on p, y' only through r. 
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Finally we note that, since all norms are equivalent on finite dimensional spaces, there is 
constant K, such that 


' j-i 


XI “ y' 3 +i\T ) < K\y - y'\. 


(2.37) 


The following theorem then shows that the posterior measure is in fact Lipschitz continuous, 
in the Hellinger metric, with respect to the data. 


Theorem 2.15 Consider the smoothing problem arising from the stochastic dynamics model 
resulting in the posterior probability distributions p and p' associated with two different 
data sets y and y'. Assume that < oo where v is given by (12.361) . Then there exists 

c = c(r) such that, for all |y|, jj/'j < r, 

dHeii{h,h') < c\y-y'\. 


Proof We have, by (I2.33p . (12.3511 . 

d]ie\\{p,p'f = ^ J l\/^- \/p'{v)\^dv 


= ^ Zn 


< h + 12 , 


Vz 




Po{v)d% 


where 


h=Zo 


1 




2 

Po{v)dv 


and, using (12.331) and (I2.34p . 


I2 — Zr 


= Z' 


1 

71 

1 

71’ 


1 

1 

7F 


„-<S>{v,y' 


'>Po{v)dv 


We estimate I 2 first. Since, as shown before the theorem, Z, Z' are bounded below by a 
positive constant depending only on r, we have 

= l|v^_ v^|2 = 4 \Z-Z'\^ ^ _ 

As > 0 and ^{v\y') > 0 we have from (I2.33p . (I2.34p . using the fact that e““ is 

Lipschitz on R+, 


\Z-Z'\ < Zo 
< Zq 


\^{v;y) - <i>{v;y')\po{v)dv. 
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By definition of $ and use of (12.371) 


!$(?;; y) - $(u;y')l ^ \ I%+i “ 2/j+i lr|%+i + Vj+i “ ‘2h{vj+i)\r 
j=o 

l%+i + y'j+i - 2/i(wi+i)lr j 

j-i \ ^ 

i=o / 

= 7f|y-y'|v5. 

Since < oo implies that E^“v^ < oo it follows that 

\Z-Z'\<K\y-y'\. 


< K\y-y'\ 


' J-i 


Hence I 2 < K\y — y'p 

Now, using that Zq is bounded above independently of r, that Z is bounded below, 
depending on data only through r, and that is Lipschitz on R+, it follows that Ii 

satisfies 

Ii<kJ \^{v,y) - ^{v,y')\'^po{v)dv. 

Squaring the preceding bound on |d>('c; y) — y')\ gives 

|$(?;;y) - ^{v,y')\^ < K\y - y'|^v 

and so Ii < K\y — y'\^ as required. □ 

Corollary 2.16 Consider the smoothing problem arising from the stochastic dynamics model 
(12.11) . resulting in the posterior probability distributions y and p' associated with two different 
data sets y and y'. Assume that E'^'^v < 00 where v is given by (12.361) . Let f : —>■ Rp 

be such that EP“|/(ri)p < cxd. Then there is c = c{r) > 0 such that, for all \y\, \y'\ < r, 

\^^f{v) - E'"'/(^')l < c|y - y'\. 

Proof First note that, since ^{v;y) > 0, Zq is bounded above independently of r, and since 
Z is bounded from below depending only on r, E^|/(ri)p < cE^°\f{v)\‘^; and a similar bound 
holds under p'. The result follows from ()1.13p and Theorem 12.151 □ 

Using the relationship between filtering and smoothing as described in the previous section, 
we may derive a corollary concerning the filtering distribution . 

Corollary 2.17 Consider the smoothing problem arising from the stochastic dynamics model 
m, resulting in the posterior probability distributions p and p' associated with two different 
data sets y and y'. Assume that E^^v < c» where v is given by (12.361) . Let g : K” —?► Rp be 
such that EP“|y(r;j)p < c». Then there is c = c(r) > 0 such that, for all \y\, \y'\ < r, 

\W^g{u)-W^g{u)\<c\Yj-Yf,\, 

where pj and p'j denote the filtering distributions at time J corresponding to data Yj,Yj 
respectively (i.e. the marginals of p and p' on the coordinate at time J). 
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Proof Since, by Theorem l2.121 fij is the marginal of the smoother on the vj coordinate, the 
result follows from Corollarv l2.16l bv choosing f{v) = g{vj). □ 

A similar theorem, and corollaries, may be proved for the case of deterministic dynamics 
(lOD . and the posterior P(tio|?/). We state the theorem and leave its proof to the reader. We 
let vq denote the prior Gaussian measure N(mo, Cq) on vq for the smoothing problem arising 
in deterministic dynamics, and v and v' the posterior measures on vq resulting from two 
different instances of the data, y and y' respectively. We also define 

Vo := 

j=o 

Theorem 2.18 Consider the smoothing problem arising from the deterministic dynamics 
model (lOl) . Assume thatE,'^°VQ < oo. Then there is c = c{r) > 0 such that, for all |y|, \y'\ < r, 

< c|y-y'|. 

2.7 Assessing The Quality of Data Assimilation Algo¬ 
rithms 

It is helpful when studying algorithms for data assimilation to ask two questions: (i) how 
informative is the data we have? (ii) how good is our algorithm at extracting this information? 
These are two separate questions, answers to both of which are required in the quest to 
understand how well we perform at extracting a signal, using model and data. We take the 
two questions in separately, in turn; however we caution that many applied papers entangle 
them both and simply measure algorithm quality by ability to reconstruct the signal. 

Answering question (i) is independent of any particular algorithm: it concerns the prop¬ 
erties of the Bayesian posterior pdf itself. In some cases we will be interested in studying 
the properties of the probability distribution on the signal, or the initial condition, for a 
particular instance of the data generated from a particular instance of the signal, which we 
call the truth. In this context we will use the notation yl = {yj} to denote the realization 

of the data generated from a particular realization of the truth = {uj}. We first discuss 
properties of the smoothing problem for stochastic dynamics . Posterior consistency con¬ 
cerns the question of the limiting behaviour of P(ri|y^) as either J ^ oo (large data sets) or 
|r| —>■ 0 (small noise). A key question is whether P(u|yl) converges to the truth in either 
of these limits; this might happen, for example, if P(u|y^) becomes closer and closer to a 
Dirac probability measure centred on . When this occurs we say that the problem exhibits 
Bayesian posterior consistency; it is then of interest to study the rate at which the limit 
is attained. Such questions concern the information content of the data; they do not refer 
to any algorithm and therefore they are not concerned with the quality of any particular 
algorithm. When considering filtering, rather than smoothing, a particular instance of this 
question concerns marginal distributions: for example one may be concerned with posterior 
consistency of P(uj|yj) with respect to a Dirac on Uj in the filtering case, see Theorem 12. 121 
for the case of deterministic dynamics the distribution P(u|y^) is completely determined by 
P(uoly'l') (see Theorem l2.14ll so one may discuss posterior consistency of P(uo|y^) with respect 
to a Dirac on uj. 

Here it is appropriate to mention the important concept of model error. In many (in fact 
most) applications the physical system which generates the data set {yj} can be (sometimes 
significantly) different from the mathematical model used, at least in certain aspects. This 
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can be thought of conceptually by imagining data generated by (lO) . with = {uj} governed 
by the deterministic dynamics 

= ^'true(w]), j e (2.38a) 

rij = u ~ A^(mo, Co). (2.38b) 

Here the function 4'true governs the dynamics of the truth which underlies the data. We 
assume that the true solution operator is not known to us exactly, and seek instead to combine 
the data with the stochastic dynamics model (EB; the noise is used to allow for the 
discrepancy between the true solution operator 'I'true and that used in our model, namely dt. 
It is possible to think of many variants on this situation. For example, the dynamics of the 
truth may be stochastic; or the dynamics of the truth may take place in a higher-dimensional 
space than that used in our models, and may need to be projected into the model space. 
Statisticians sometimes refer to the situation where the data source differs from the model 
used as model misspecification. 

We now turn from the information content, or quality, of the data to the quality of 
algorithms for data assimilation. We discuss three approaches to assessing quality. The first 
fully Bayesian approach can be defined independently of the quality of the data. The second 
estimation approach entangles the properties of the algorithm with the quality of the data. We 
discuss these two approaches in the context of the smoothing problem for stochastic dynamics 
. The reader will easily see how to generalize to smoothing for deterministic dynamics, or to 
filtering. The third approach is widely used in operational numerical weather prediction and 
judges quality by the ability to predict. 

Bayesian Quality Assessment. Here we assume that the algorithm under consideration 
provides an approximation Papprox('i'|2/) to the true posterior distribution P(z;|?/). We ask 
the question: how close is Papprox(i'ly) to P(u|y). We might look for a distance measure 
between probability distributions, or we might simply compare some important moments 
of the distributions, such as the mean and covariance. Note that this version of quality 
assessment does not refer to the concept of a true solution rib We may apply it with y = y\ 
but we may also apply it when there is model error present and the data comes from outside 
the model used to perform data assimilation . However, if combined with Bayesian posterior 
consistency, when y = y\ then the triangle inequality relates the output of the algorithm 
to the truth . Very few practitioners evaluate their algorithms by this measure. This 
reflects the fact that knowing the true distribution P(u| 2 /) is often difficult in practical high 
dimensional problems. However it is arguably the case that practitioners should spend more 
time querying their algorithms from the perspective of Bayesian quality assessment since the 
algorithms are often used to make probabilistic statements and forecasts. 

Signal Estimation Quality Assessment. Here we assume that the algorithm under 
consideration provides an approximation to the signal v underlying the data, which we denote 
by fapprox; thus Uapprox attempts to determine and then track the true signal from the data. 
If the algorithm actually provides a probability distribution, then this estimate might be, for 
example, the mean. We ask the question: if the algorithm is applied in the situation where 
the data is generated from the the signal z;^, how close is Uapprox to v^l There are two 
important effects at play here: the first is the information content of the data - does the 
data actually contain enough information to allow for accurate reconstruction of the signal in 
principle; and the second is the role of the specific algorithm used ~ does the specific algorithm 
in question have the ability to extract this information when it is present. This approach thus 
measures the overall effect of these two in combination. 

Forecast Skill. In many cases the goal of data assimilation is to provide better forecasts 
of the future, for example in numerical weather prediction. In this context data assimilation 
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algorithms can be benchmarked by their ability to make forecasts. This can be discussed in 
both the Bayesian quality and signal estimation senses. We first discuss Bayesian Estimation 
forecast skill in the context of stochastic dynamics . The Bayesian fc-lag forecast skill can be 
defined by studying the distance between the approximation Papprox('c|2/) and P(f |y) when 
both are pushed forward from the end-point of the data assimilation window by k applications 
of the dynamical model (12.11) : this model dehnes a Markov transition kernel which is applied 
fc—times to produce a forecast. We now discuss signal estimation forecast skill in the context 
of deterministic dynamics. Using r^approx at the end point of the assimilation window as an 
initial condition, we run the model (12.31) forward by k steps and compare the output with 
In practical application, this forecast methodology inherently confronts the effect of 
model error , since the data used to test forecasts is real data which is not generated by the 
model used to assimilate, as well as information content in the data and algorithm quality. 


2.8 Illustrations 


In order to build intuition concerning the probabilistic viewpoint on data assimilation we 
describe some simple examples where the posterior distribution may be visualized easily. For 
this reason we concentrate on the case of one-dimensional deterministic dynamics; the poste¬ 
rior pdf P(t'o|y) for deterministic dynamics is given by Theorem 12.111 It is one-dimensional 
when the dynamics is one-dimensional and takes place in K. In section [3] we will introduce 
more sophisticated sampling methods to probe probability distributions in higher dimensions 
which arise from noisy dynamics and/or from high dimensional models. 

Figure E. 101 concerns the scalar linear problem from Example 12.11 frecall that throughout 
this section we consider only the case of deterministic dynamics) with A = 0.5. We employ 
a prior W(4, 5), we assume that h{v) = v, and we set T = 7 ^ and consider two different 
values of 7 and two different values of J, the number of observations. The figure shows 
the posterior distribution in these various parameter regimes. The true value of the initial 
condition which underlies the data is = 0.5. For both 7 = 1.0 and 0.1 we see that, as 
the number of observations J increases, the posterior distribution appears to converge to 
a limiting distribution. However for smaller 7 the limiting distribution has much smaller 
variance, and is centred closer to the true initial condition at 0.5. Both of these observations 
can be explained, using the fact that the problem is explicitly solvable: we show that for 
fixed 7 and J ^ 00 the posterior distribution has a limit, which is a Gaussian with non-zero 
variance. And for fixed J as 7 —> 0 the posterior distribution converges to a Dirac measure 
(Gaussian with zero variance) centred at the truth v^. 

To see these facts we start by noting that from Theorem 12.111 the posterior distribution 
on V{)\y is proportional to the exponential of 


j-i 


ldet(vo;y) = + 1 “ ^ 0 1 

^7 


where (Tq denotes the prior variance Cq. As a quadratic form in vg this defines a Gaussian 
posterior distribution and we may complete the square to find the posterior mean m and 
variance 


1 

2 

^post 


' j=0 


1 /A^ - A^-'+^x 1 

^ V 1 - a2 j 
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and 


1 


^ post 


= i + ^mo. 

7 „_n "0 


i=o 


We note immediately that the posterior variance is independent of the data. Furthermore, if 
we hx 7 and let J —>■ oo then for any |A| < 1 we see that the large J limit of the posterior 
variance is determined by 


1 


''post 


1 

7^ 




1 


and is non-zero; thus uncertainty remains in the posterior, even in the limit of large data. On 
the other hand, if we fix J and let 7 —>■ 0 then —>■ 0 so that uncertainty disappears in 

this limit. It is then natural to ask what happens to the mean. To this end we assume that 
the data is itself generated by the linear model of Example 12.11 so that 


Vj+i = -b -fCj+i 

where Q is an i.i.d. Gaussian sequence with Ci ^ N{0, 1). Then 


1 / A 2 - A 2‘^+2 


-m = 


' post 


r 


i-l 


- A2 J 




J -1 




■3 + i 


j=0 


Using the formula for Cpost we obtain 

/ A^ - \ 72 / A^ - A2‘^+2 

[ 1 -A2 r+ =i 1 -A2 




rwo. 


.7^\o-+i)C,+i 

j=o 


7^ 

—mo. 

0 


CT, 


From this it follows that, for fixed J and as 7 —>■ 0, m —>■ tij, almost surely with respect to 
the noise realization This is an example of posterior consistency. 



U U 


(a) 7 = 1 (b) 7 = 0.1 


Figure 2.10: Posterior distribution for Examples 12 .1 1 for different levels of observational noise. 
The true initial condition used in both cases is Vq = 0.5, while we have assumed that Cq = 5 
and niQ = 4 for the prior distribution. 

We now study Example 12.41 in which the true dynamics are no longer linear. We start our 
investigation taking r = 2 and investigate the effect of choosing different prior distributions. 
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Before discussing the properties of the posterior we draw attention to two facts. Firstly, as 
Figure [2^ shows, the system converges in a small number of steps to the fixed point at 1/2 
for this value of r = 2 . And secondly the initial conditions vq and 1 — uq both result in the 
same trajectory, if the initial condition is ignored. The first point implies that, after a small 
number of steps, the observed trajectory contains very little information about the initial 
condition. The second point means that, since we observe from the hrst step onwards, only 
the prior can distinguish between vq and 1 — uq as the starting point. 

Figure [2Tl] concerns an experiment in which the true initial condition underlying the data 
is Vq = 0.1. Two different priors are used, both with Cq = 0.01, giving a standard deviation 
of 0.1, but with different means. The figure illustrates two facts: firstly, even with 10^ 
observations, the posterior contains considerable uncertainty, reflecting the first point above. 
Secondly the prior mean has an important role in the form of the posterior pdf: shifting the 
prior mean to the right, from uiq = 0.4 to tuq = 0.7, results in a posterior which favours the 
initial condition 1 — uj rather than the truth uj. 




(a) r = 2, mo = 0.4 (b) r = 2 mo = 0.7 


Figure 2.11: Posterior distribution for Example [2A] for r = 2 in the case of different means for 
the prior distribution. We have used Cq = 0.01, 7 = 0.1 and true initial condition vg = 0.1, 
see also p2 .m in section [5.1.2l 


This behaviour of the posterior changes completely if we assume a flatter prior. This 
is illustrated in Figure 12.121 where we consider the prior A^(0.4, Cq) with Cq = 0.5 and 5 
respectively. As we increase the prior covariance the mean plays a much weaker role than in 
the preceding experiments: we now obtain a bimodal posterior centred around both the true 
initial condition Vq, and also around 1 — uj. 

In Figure 12.131 we consider the quadratic map (12.91) with r = 4, J = 5 and prior 
A^(0.5,0.01), with observational standard deviation 7 = 0.2. Here, after only five obser¬ 
vations the posterior is very peaked, although because of the u 1 —>■ 1 — u symmetry mentioned 
above, there are two symmetrically related peaks; see Figure [?.13h . It is instructive to look at 
the negative of the logarithm of the posterior pdf which, upto an additive constant, is given 
by ldet(fo; 2 /) in Theorem l2.11l The function ldet(’; 2 /) is shown in Figure [2.13b . Its complexity 
indicates the considerable complications underlying solution of the smoothing problem. We 
will return to this last point in detail later. Here we simply observe that normalizing the 
posterior distribution requires evaluation of the integral 
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Figure 2.12: Posterior distribution for Example [23] for r = 2 in the case of different covariance 
for the prior distribution. We have used toq = 0.4, 7 = 0.1 and true initial condition vq = 0.1. 


This integral may often be determined almost entirely by very small subsets of R”, meaning 
that this calculation requires some care; indeed if !(•) is very large over much of its domain 
then it may be impossible to compute the normalization constant numerically. We note, 
however, that the sampling methods that we will describe in the next chapter do not require 
evaluation of this integral. 



Figure 2.13: Posterior distribution and negative log posterior for Example 12.41 for r = 4 and 
J = 5. We have used Co = 0.01, mo = 0.5, 7 = 0.2 and true initial condition vo = 0.3. 


2.9 Bibliographic Notes 

• Section I^Tl Data Assimilation has its roots in the geophysical sciences, and is driven by 
the desire to improve inaccurate models of complex dynamically evolving phenomena 
by means of incorporation of data. The book | 66 | describes data assimilation from 
the viewpoint of the atmospheric sciences and weather prediction, whilst the book [ 10 ] 
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describes the subject from the viewpoint of oceanography. These two subjects were the 
initial drivers for evolution of the field. However, other applications are increasingly 
using the methodology of data assimilation, and the oil industry in particular is heavily 
involved in the use, and development, of algorithms in this area |93j . The recent book 
[T] provides a perspective on the subject from the viewpoint of physics and nonlinear 
dynamical systems, and includes motivational examples from neuroscience, as well as 
the geophysical sciences. The article [60] is a useful one to read because it establishes a 
notation which is now widely used in the applied communities and the articles llolE] 
provide simple introductions to various aspects of the subject from a mathematical 
perspective. The special edition of the journal PhysicaD, devoted to Data Assimilation, 
[6T] . provides an overview of the state of the art around a decade ago. 

• It is useful to comment on generalizations of the set-up described in section [2T| First 
we note that we have assumed a Gaussian structure for the additive noise appearing 
in both the signal model (EH) and the data model EH- This is easily relaxed in 
much of what we describe here, provided that an explicit formula for the probability 
density function of the noise is known. However the Kalman filter, described in the next 
chapter, relies explicitly on the closed Gaussian form of the probability distributions 
resulting from the assumption of Gaussian noise. There are also other parts of the 
notes, such as the pCN MGMG methods and the minimization principle underlying 
approximate Gaussian filters, also both described in the next chapter, which require 
the Gaussian structure. Secondly we note that we have assumed additive noise. This, 
too, can be relaxed but has the complication that most non-additive noise models do 
not yield explicit formulae for the needed conditional probability density functions; for 
example this situation arises if one looks at stochastic differential equations over discrete 
time-intervals - see m and the discussion therein. However some of the methods we 
describe rely only on drawing samples from the desired distributions and do not require 
the explicit conditional probability density function . Finally we note that much of 
what we describe here translates to infinite dimensional spaces with respect to both 
the signal space, and the the data space; however in the infinite dimensional data space 
case the additive Gaussian observational noise is currently the only situation which is 
well-developed [106] . 

• Section [221 The subject of deterministic discrete time dynamical systems of the form 
EH is overviewed in numerous texts; see m and the references therein, and Ghapter 
1 of [107] . for example. The subject of stochastic discrete time dynamical systems of 
the form EH, and in particular the property of ergodicity which underlies Figure |2H 
is covered in some depth in m- The exact solutions of the quadratic map (12.9p for 
r = 2 and r = 4 mav be found in [101] and m respectively. The Lorenz ’63 model 
was introduced in m- Not only does this paper demonstrate the possibility of chaotic 
behaviour and sensitivity with respect to initial conditions, but it also makes a concrete 
connection between the three dimensional continuous time dynamical system and a one¬ 
dimensional chaotic map of the form (12.11) . Furthermore, a subsequent computer assisted 
proof demonstrated rigorously that the ODE does indeed exhibit chaos [midiS]. The 
book [105] discusses properties of the Lorenz ’63 model in some detail and the book [41] 
discussed properties such as fractal dimension. The shift of origin that we have adopted 
for the Lorenz ’63 model is explained in |109] : it enables the model to be written in an 
abstract form which includes many geophysical models of interest, such as the Lorenz 
’96 model introduced in m, and the Navier-Stokes equation on a two-dimensional 
torus [m 1109] . We now briefly describe this common abstract form. The vector u G 
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(J = 3 for Lorenz ’63, J arbitrary for Lorenz’ 96) solves the equation 


— + Au + B{u,u) = f, u{0) = uq, (2.39) 

where there is A > 0 such that, for all w G , 

(Alic, w) > Alwp, {B{w,w),w) = 0. 


Taking the inner-product with u shows that 

^^|wp + A|up < (/,m). 

If / is constant in time then this inequality may be used to show that (I2.16|) holds: 




Integrating this inequality gives the existence of an absorbing set and hence leads to 
the existence of a global attractor; see Example 11.221 the book |109j or Chapter 2 of 
[in3, for example. 

• Section 12.31 contains the formulation of Data Assimilation as a fully nonlinear and 
non-Gaussian problem in Bayesian statistics. This formulation is not yet the basis 
of practical algorithms in the geophysical systems such as weather forecasting. This is 
because global weather forecast models involve n = 0(10®) unknowns, and incorporate 
m = 0(10®) data points daily; sampling the posterior on R" given data in R™ in an 
online fashion, usable for forecasting, is beyond current algorithmic and computational 
capability. However the fully Bayesian perspective provides a fundamental mathemat¬ 
ical underpinning of the subject, from which other more tractable approaches can be 
systematically derived. See [106] for discussion of the Bayesian approach to inverse 
problems. Historically, data assimilation has not evolved from this Bayesian perspec¬ 
tive, but has rather evolved out of the control theory perspective. This perspective is 
summarized well in the book |63j . However, the importance of the Bayesian perspective 
is increasingly being recognized in the applied communities. In addition to providing 
a starting point from which to derive approximate algorithms, it also provides a gold 
standard against which other more ad hoc algorithms can be benchmarked; this use 
of Bayesian methodology was suggested in m in the context of meteorology (see dis¬ 
cussion that follows), and then employed in [62) in the context of subsurface inverse 
problems arising in geophysics. 

• Section 12.41 describes the filtering, or sequential, approach to data assimilation, within 
the fully Bayesian framework. For low dimensional systems the use of particle filters, 
which may be shown to rigorously approximate the required filtering distribution as it 
evolves in discrete time, has been enormously successful; see [3 9) for an overview. Un¬ 
fortunately, these filters can behave poorly in high dimensions [13 El 1103] . Whilst there 
is ongoing work to overcome these problems with high-dimensional particle filtering, see 
[141 12511115] for example, this work has yet to impact practical data assimilation in, 
for example, operational weather forecasting. For this reason the ad hoc filters, such as 
3DVAR, extended Kalman Filter and ensemble Kalman Filter, described in Chapter |4l 
are of great practical importance. Their analysis is hence an important challenge for 
applied mathematicians. 
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• Section 12.61 Data assimilation may be viewed as an inverse problem to determine the 
signal from the observations. Inverse problems in differential equations are often ill- 
posed when viewed from a classical non-probabilistic perspective. One reason for this 
is that the data may not be informative about the whole signal so that many solutions 
are possible. However taking the Bayesian viewpoint, in which the many solutions are 
all given a probability, allows for well-posedness to be established. This idea is used 
for data assimilation problems arising in fluid mechanics in [28], for inverse problems 
arising in subsurface geophysics in |36| [34] and described more generally in [106] . Well- 
posedness with respect to changes in the data is of importance in its own right, but also 
more generally because it underpins other stability results which can be used to control 
perturbations. In particular the effect of numerical approximation on integration of the 
forward model can be understood in terms of its effect on the posterior distribution; see 
[29] . A useful overview of probability metrics, including Hellinger and total variation 
metrics, is contained in m- 

• Section 12.71 The subject of posterior consistency is central to the theory of statistics 
in general [113] , and within Bayesian statistics in particular [nuniis]. Assessing the 
quality of data assimilation algorithms is typically performed in the “signal estimation” 
framework using identical twin experiments in which the data is generated from 
the same model used to estimate the signal; see m and the references therein. The 
idea of assessing “Bayesian quality” has only recently been used within the data as¬ 
similation literature; see [H] where this approach is taken for the Navier-Stokes inverse 
problem formulated in [55]. The evaluation of algorithms by means of forecast skill is 
enormously influential in the field of numerical weather prediction and drives a great 
deal of algorithmic selection. The use of information theory to understand the effects of 
model error , and to evaluate filter performance, is introduced in [81] and m respec¬ 
tively. There are also a number of useful consistency checks which can be applied 
to evaluate the computational model and its fit to the data gnmi]. We discuss the 
idea of the variant known as rank histograms at the end of Chapter |4| If the empirical 
statistics of the innovations are inconsistent with the assumed model, then they can be 
used to improve the model used in the future; this is known as reanalysis. 


2.10 Exercises 

1. Consider the map given in Example 12.31 and related program pi .m. By experimenting 
with the code determine approximately the value of a, denoted by ai, at which the 
noise-free dynamics changes from convergence to an equilibrium point to convergence 
to a period 2 solution. Can you find a value oi a = a 2 for which you obtain a period 4 
solution? Can you find a value oi a = for which you obtain a non-repeating (chaotic) 
solution? For the values a = 2, 012 and 013 compare the trajectories of the dynamical 
system obtained with the initial condition I and with the initial condition I.l. Comment 
on what you find. Now fix the initial condition at I and consider the same values of 
a, with and without noise {a € {0,1}). Comment on what you find. Illustrate your 
answers with graphics. To get interesting displays you will find it helpful to change the 
value of J (number of iterations) depending upon what you are illustrating. 

2. Consider the map given in Example 12.41 and verify the explicit solutions given for r = 2 
and r = 4 in formulae (I2.10I1 - (I2.12I1 . 

3. Consider the Lorenz ’63 model given in Example 12.61 Determine values of {a,/3} for 
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which (12.161) holds. 

4. Consider the Lorenz ’96 model given in Example 12.71 Program pl9 .m plots solutions 
of the model, as well as studying sensitivity to initial conditions. Study the behaviour 
of the equation for J = 40, F = 2, for J = 40, F = 4 and report your results. Fix F 
at 8 and play with the value of the dimension of the system, J. Report your results. 
Again, illustrate your answers with graphics. 

5. Consider the posterior smoothing distribution from Theorem 12.81 Assume that the 

stochastic dynamics model (EH) is scalar and dehned by 4'(r!) = av for some a € K. and 
E = cr^; and that the observation model (12.21) is defined by h{v) = v and P = 7 ^. Find 
explicit formulae for J(u) and assuming that vq ^ N{mo,aQ). 

6 . Consider the posterior smoothing distribution from Theorem 12.111 Assume that the 
dynamics model (I2.3al) is scalar and defined by d' (v) = av for some a € R; and that the 
observation model (12.21) is dehned by h(v) = v and P = 7 ^. Find explicit formulae for 
Jdet(^^o) and 4)det(^^o;2/), assuming that vq ~ (mo, erg). 

7. Consider the dehnition of total variation distance given in Dehnition 11.271 State and 
prove a theorem analogous to Theorem l2.151 but employing the total variation distance 
instead of the Hellinger distance. 

8 . Consider the hltering distribution from section 12.41 in the case where the stochastic 
dynamics model EH is scalar and dehned by d/ (v) = av for some a G R and E = cr^; and 
that the observation model ( 12 . 21 ) is dehned by h{v) = v and P = 7 ^; and vq ~ N{mo, erg). 
Demonstrate that the prediction and analysis steps preserve Gaussianity so that fij = 
N{mj,aj). Find iterative formulae which update {mj,aj) to give (m^+i, crl^^)- 

9. Prove Theorem 12.181 
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Chapter 3 


Discrete Time: Smoothing Algorithms 


The formulation of the data assimilation problem described in the previous chapter is prob¬ 
abilistic, and its computational resolution requires the probing of a posterior probability 
distribution on signal given data. This probability distribution is on the signal sequence 
V = {vj}j^Q when the underlying dynamics is stochastic and given by (EH); the posterior is 
specified in Theorem 12.81 and is proportional to exp(—!(?;;y)) given by (I2.2ip . On the other 
hand, if the underlying dynamics is deterministic and given by (lO) . then the probability 
distribution is on the initial condition vq only; it is given in Theorem l2.1fl and is proportional 
to exp(—ldet(wo; y)), with Idet given by (12.291) . Generically, in this chapter, we refer to the 
unknown variable as u, and then use v in the specific case of stochastic dynamics, and vq in 
the specific case of deterministic dynamics . The aim of this chapter is to understand F{u\y). 
In this regard we will do three things: 

• find explicit expressions for the pdf P(u|y) in the linear, Gaussian setting; 

• generate samples from P(M|y) by algorithms applicable in the non-Gaussian 

setting; 

• find points where P(u|y) is maximized with respect to u, for given data y. 


In general the probability distributions of interest cannot be described by a finite set of 
parameters, except in a few simple situations such as the Gaussian scenario where the mean 
and covariance determine the distribution in its entirety - the Kalman Smoother. When 
the probability distributions cannot be described by a finite set of parameters, an expedient 
computational approach to approximately representing the measure is through the idea of 
Monte Carlo sampling. The basic idea is to approximate a measure by a set of N 
samples {w*'”^}rtGZ+ drawn, or approximately drawn, from v to obtain the measure fa v 
given by: 


V 


N 


N 


N 




,(")■ 


(3.1) 


n—1 

We may view this as defining a (random) map on measures which takes v into . If the 
are exact draws from ly then the resulting approximation converges to the true measure 
V as N ^ oo. 0 For example if u = {vj}j^Q is governed by the probability distribution yo 
defined by the unconditioned dynamics (12T|) . and with pdf determined by (I2.19() . then exact 
independent samples are easy to generate, simply by running the dynamics 


^Indeed we prove such a result in Lemma 14.71 in the context of the particle filter. 
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model forward in discrete time. However for the complex probability measures of interest 
here, where the signal is conditioned on data, exact samples are typically not possible and so 
instead we use the idea of Monte Carlo Markov Chain (MCMC) methods which provide 
a methodology for generating approximate samples. These methods do not require knowledge 
of the normalization constant for the measure P(u|?/); as we have discussed, Bayes’ formula 
HU) readily delivers P(M|y) upto normalization, but the normalization itself can be difficult 
to compute. It is also of interest to simply maximize the posterior probability distribution, 
to find a single point estimate of the solntion, leading to variational methods, which we 
also consider. 

Section [XT] gives explicit formulae for the solution of the smoothing problem in the setting 
where the stochastic dynamics model is linear, and subject to Gaussian noise, for which the 
observation operator is linear, and for which the distributions of the initial condition and the 
observational noise are Gaussian; this is the Kalman smoother. These explicit formulae help 
to build intuition about the nature of the smoothing distribution. In section [3.21 we provide 
some background concerning MCMC methods, and in particular, the Metropolis-Hastings 
variant of MCMC, and show how they can be used to explore the posterior distribution. It 
can be very difficult to sample the probability distributions of interest with high accuracy, 
because of the two problems of high dimension and sensitive dependence on initial conditions. 
Whilst we do not claim to introduce the optimal algorithms to deal with these issues, we 
do discuss such issues in relation to the samplers we introduce, and we provide references 
to the active research ongoing in this area. Furthermore, although sampling of the posterior 
distribution may be computationally infeasible in many situations, where possible, it provides 
an important benchmark solution, enabling other algorithms to be compared against a “gold 
standard” Bayesian solution. 

However, because sampling the posterior distribution can be prohibitively expensive, a 
widely used computational methodology is simply to find the point which maximizes the 
probability, using techniques from optimization. These are the variational methods, also 
known as 4DVAR and weak constraint 4DVAR . We introduce this approach to the 
problem in section 13.31 In section 13.41 we provide numerical illustrations which showcase 
the MCMC and variational methods. The chapter concludes in sections 13.51 and 13.61 with 
bibliographic notes and exercises. 

3.1 Linear Gaussian Problems: The Kalman Smoother 

The Kalman smoother plays an important role because it is one of the few examples for 
which the smoothing distribution can be explicitly determined. This explicit characterization 
occurs because the signal dynamics and observation operator are assumed to be linear. When 
combined with the Gaussian assumptions on the initial condition for the signal, and on the 
signal and observational noise, this gives rise to a posterior smoothing distribution which is 
also Gaussian. 

To Hnd formulae for this Gaussian Kalman smoothing distribution we set 

'^{y)=Mv, h{v) = Hv (3.2) 

for matrices M € H € and consider the signal/observation model (12.IL ()2.2I1 . 

Given data y = and signal v = {"Cjlj-GJo interested in the probability distri¬ 

bution of v\y, as characterized in subsection 12.3.21 By specifying the linear model (13.21) . and 
applying Theorem 12.81 we find the following: 
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Theorem 3.1 The posterior smoothing distribution onv\y for the linear stochastic dynamics 
model (EU), (122]); (ESI) with Co, E andT symmetric positive-definite is a Gaussian probability 
measure pL = N{m,C) on covariance C is the inverse of a symmetric positive- 

definite block tridiagonal precision matrix 


( Lii 

L 21 


L = 


Li2 

L 22 


L 


23 




V 


Ljj+1 

Lj+ij Lj+ij+i j 


with Uj € given by Ln = Ljj = + E’^ for 

j = Ej+i,j+i = + Y-^, Ljj+i = -M^E"! and Lj+ij = -E-^M for 

j = 1,..., J. Furthermore the mean m solves the equation 


Lm = r, 


where 

ri=CQ^mo, rj = H'^T~^yj_i, j = 2, ■ ■ ■ , J -\-1. 
This mean is also the unique minimizer of the functional 


K^',y) = \\Co ^^^iyo - mo)|^ + X! ^1^ ^ i|r -Hv^+i) 


j-i . 

|2 i 


J-1 

2 x-^ i I 


3=0 


3=0 


1 2 1 1 
= 21^0 “’^olco + 2 !^^'+^ “ 

3=0 3=0 


'j+ilr 


(3.3) 


with respect to v and, as such, is a maximizer, with respect to v, for the posterior pdf'S’(v\y). 

Proof The proof is based around Lemma 11.61 and identification of the mean and covariance 
by study of an appropriate quadratic form. From Theorem 12.81 we know that the desired 
distribution has pdf proportional to exp(— 1 (?;; 2 /)), where l(u; 2 /) is given in (lT3l) . This is a 
quadratic form in v and we deduce that the inverse covariance L is given by dy\{v,y), the 
Hessian of I with respect to v. To determine L we note the following identities: 

Dll{v,y)=Co^+M^Y-^M, 

Dl.I{v;y) = j = I,..., J-I 

Dlliv,y) = Y-^+H^T-^H, 

Dl^,^^J{v,y) = -M^Y-\ 

Dl^^^^Jiv,y) = -Y-^M. 

We may then complete the square and write 

y) = - rn),L{v - m)) + q, 

where q is independent of v. From this it follows that the mean does indeed minimize l(u; y) 
with respect to v, and hence maximizes P(u|?/) oc exp(—l(u;y)) with respect to v. By differ¬ 
entiating with respect to v we obtain 


Lm = r, r 


-Y/fi{v,y) 


(3.4) 


55 





where is the gradient of I with respect to v. This characterization of r gives the desired 
equation for the mean. Finally we show that L, and hence C, is positive definite symmetric. 
Clearly L is symmetric and hence so is C. It remains to check that L is strictly positive 
definite. To see this note that if we set mg = 0 then 

= l(n;0) >0. (3.5) 

Moreover, l(z;; 0) = 0 with mg = 0 implies, since Cg > 0 and S > 0, 


Wg = 0, 

Vj+i = Mvj, j = 0, ...,J-1, 

i.e. r; = 0. Hence we have shown that (n, Lv) = 0 implies n = 0 and the proof is complete. □ 
We now consider the Kalman smoother in the case of deterministic dynamics. Application 
of Theorem 12.111 gives the following: 


Theorem 3.2 The posterior smoothing distribution on V{)\y for the deterministic linear dy¬ 
namics model (lO) . (lO) . (E21) with Cq and T symmetric positive definite is a Gaussian 
probability measure v = N(mdet-,Cdet) on K". The covariance Cdet is the inverse of the 
positive-definite symmetric matrix Tdet given by the expression 

,7-1 

Ldet = Cg-^ + 

j=o 


The mean mdet solves 


j-i 

LdetrUdet = Cg-^mg + + + 

i=o 

This mean is a minimizer of the functional 

ldet(vo;2/) = 21^0 -TOolco + 2\y3+^ -HM3+^Vd\\ (3.7) 

3=0 

with respect to vq and, as such, is a maximizer, with respect to vq, of the posterior pdf ]P{vo\y). 

Proof By Theorem 12.111 we know that the desired distribution has pdf proportional to 
exp(—ldet("i^g; y)) given by (13.71) . The inverse covariance Ldet can be found as the Hessian of 
ldet, Ldet = 9^ldet(7^o; y), and the mean mdet solves 


LdetTUdet — VTjldet(nc 5 y) 

vo=0 

As in the proof of the preceding theorem, we have that 

ldet(wg;y) = ^(Ldet(^'0 - Wdet), (vq - mdet)) + Q 


(3.8) 


where q is independent of vq; this shows that mdet minimizes ldet(‘ ;y) and maximizes P(-|y). 

We have thus characterized Ldet and mdet and using this characterization gives the desired 
expressions. It remains to check that Ldet is positive definite, since it is clearly symmetric 
by definition. Positive-definiteness follows from the assumed positive-definiteness of Cg and 
T since, for any nonzero Vg £ R”, 


{vO,LdetVo) > {vqCq ^Tlg) > 0. 
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(3.9) 

□ 







3.2 Markov Chain-Monte Carlo Methods 


In the case of stochastic dynamics, equation (EU, the posterior distribution of interest is the 
measure fj, on Kl-lfolxn^ with density IP(r’| 2 /) given in Theorem 12.81 in the case of deterministic 
dynamics, equation (lO) . it is the measure v on M" with density P(no|?/) given in Theorem 
12.111 In this section we describe the idea of Markov Chain-Monte Carlo (MCMC ) methods 
for exploring such probability distributions. 

We will start by describing the general MCMC methodology, after which we discuss the 
specific Metropolis-Hastings instance of this methodology. This material makes no reference 
to the specific structure of our sampling problem; it works in the general setting of creating 
a Markov chain which is invariant for a an arbitrary measure /r on with pdf p. We then 
describe applications of the Metropolis-Hastings family of MCMC methods to the smoothing 
problems of noise-free dynamics and noisy dynamics respectively. When describing the generic 
Metropolis-Hastings methodology we will use u (with indices) to denote the state of the 
Markov chain and w (with indices) the proposed moves. Thus the current state u and proposed 
state w live in the space where signal sequences v lie, in the case of stochastic dynamics, and 
in the space where initial conditions vq lie, in the case of deterministic dynamics. 

3.2.1. The MCMC Methodology 

Recall the concept of a Markov chain{w*^")}„gz+ introduced in subsection 11.4.11 The idea 
of MCMC methods is to construct a Markov chain which is invariant with respect to a 
given measure p on and, of particular interest to us, a measure p with positive Lebesgue 
density p on R^. We now use a superscript n to denote the index of the resulting Markov 
chain, instead of subscript j, to provide a clear distinction between the Markov chain defined 
by the stochastic (respectively deterministic) dynamics model (j2.ip (respectively (12.31) 1 and 
the Markov chains that we will use to sample the posterior distribution on the signal v 
(respectively initial condition vg) given data y. 

We have already seen that Markov chains allow the computation of averages with respect 
to the invariant measure by computing the running time-average of the chain - see (11.161) . 
More precisely we have the following theorem (for which it is useful to recall the notation for 
the iterated kernel p" from the very end of subsection ll.d.ip : 

Theorem 3.3 Assume that, ~ p with Lebesgue density p, then ^ p for all n £ Z"*" 

so that p is invariant for the Markov chain. If, in addition, the Markov chain is ergodic, then 
for any bounded continuous p : R^ —>• R, 

1 ^ 

n—1 

for p a.e. initial condition In particular, if there is probability measure p on R^ and 

e > 0 such that, for all u G R^ and all Borel sets A C H(R^), p{u,A) > ep(T) then, for all 
u G R^, 

dTv(p"'('u,-),l‘) < 2(1 - e)”. (3.10) 

Furthermore, there is then K = K{ip) > 0 such that 

-^y(^r('^))=E^(p(M)+iGeivA^-5 (3.11) 

n—1 

where fjy converges weakly to N(0, 1) as N ^ oo. 
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Remark 3.4 This theorem is the backbone behind MCMC. As we will see, there is a large 
class of methods which ensure invariance of a given measure fi and, furthermore, these meth¬ 
ods are often provably ergodic so that the preceding theorem applies. As with all algorithms in 
computational science, the optimal algorithm is the one the delivers smallest error for given 
unit computational cost. In this regard there are two observations to make about the preceding 
theorem. 

• The constant K measures the size of the variance of the estimator of'p{x), multiplied 
by N. It is thus a surrogate for the error incurred in running MCMC over a finite 
number of steps. The constant K depends on (p itself, but will also reflect general 
properties of the Markov chain. For a given MCMC method there will often be tunable 
parameters whose choice will affect the size of K, without affecting the cost per step 
of the Markov chain. The objective of choosing these parameters is to minimize the 
constant K, within a given class of methods all of which have the same cost per step. In 
thinking about how to do this it is important to appreciate that K measures the amount 
of correlation in the Markov chain; lower correlation leads to decreased constant K. 
More precisely, K is computed by integrating the autocorrelation of the Markov chain. 

• A further tension in designing MCMC methods is in the choice of the class of meth¬ 
ods themselves. Some Markov chains are expensive to implement, but the convergence 
in (13.111) is rapid (the constant K can be made small by appropriate choice of param¬ 
eters), whilst other Markov chains are cheaper to implement, but the convergence in 
(j3.11l) is slower (the constant K is much larger). Some compromise between ease of 
implementation and rate of convergence needs to be made. 




3.2.2. Metropolis-Hastings Methods 

The idea of Metropolis-Hastings methods is to build an MCMC method for measure p, by 
adding an accept/reject test on top of a Markov chain which is easy to implement, but which 
is not invariant with respect to /x; the accept/reject step is designed to enforce invariance 
with respect to p. This is done by enforcing detailed balance: 

p{u)p{u,w) = p{w)p{w,u) (3.12) 

Note that integrating with respect to u and using the fact that 

/ p{w, u)du = 1 


we obtain 


/ p{u)p{u,w)du = p{w) 

Jr‘ 


so that (11.181) is satisfied and density p is indeed invariant. We now exhibit an algorithm 
designed to satisfy detailed balance, by correcting a given Markov chain, which is not invariant 
with respect to p, by the addition of an accept/reject mechanism. 

We are given a probability density function p hence satisfying p : —>• R+, with 

J p{u)du = 1. Now consider a Markov transition kernel g : x —>■ R+ with the prop¬ 

erty that f q(u,w)dw = 1 for every u G R^. Recall the notation, introduced in subsection 
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11.4.11 that we use function q{u, w) to denote a pdf and, simultaneously, a probability measure 
q(u,dw). We create a Markov chain which is invariant for p as follows. DefincH 


a(u, w) = 1 /\ 


p{w)q{w,u) 
p{u)q{u,w) ■ 


(3.13) 


The algorithm is: 

1. Set n = 0 and choose 


2 . n —>■ n + 1 . 

3 . Draw ^ 

4. Set with probability otherwise. 

5. Go to step 2. 

At each step in the algorithm there are two sources or randomness: that required for 
drawing in step 3; and that required for accepting or rejecting as the next 

in step 4. These two sources of randomness are chosen to be independent of one another. 
Furthermore, all the randomness at discrete algorithmic time n is independent of randomness 
at preceding discrete algorithmic times, conditional on Thus the whole procedure 

gives a Markov chain. If z = is an i.i.d. sequence of [/[0,1] random variables then 

we may write the algorithm as follows: 

u(") = 

Here I denotes the indicator function of a set. We let p : x ^ R+ denote the transition 

kernel of the resulting Markov chain,, and we let p" denote the transition kernel over n steps; 
recall that hence p'^{u,A) = = u). Similarly as above, for fixed u, p'^{u,dw) 

denotes a probability measure on R^ with density p^{u, w). The resulting algorithm is known 
as a Metropolis-Hastings MCMC algorithm, and satisfies detailed balance with respect to p. 

Remark 3.5 The following two observations are central to Metropolis-Hastings MCMC meth¬ 
ods. 


• The construction of Metropolis-Hastings MCMC methods is designed to ensure the de¬ 
tailed balance condition (|3.12p . We will use the condition expressed in this form in 
what follows later. It is also sometimes written in integrated form as the statement 


f{u,w)p{u)p{u,w)dudw = 


/R^xR^ 


/(u, w)p{w)p(w, u)dudw 


(3.14) 


for all / : R^ X R^ — >■ R. Once this condition is obtained it follows trivially that the 
measure p with density p is invariant since, for f = f(w), we obtain 


/ / p{u)p{u,w)du \ dw = / f{w)p[w)dw / p{w,u)dv 

= / f{w)p{w)dw. 

Jk.‘ 


^Recall that we use the A operator to denote the minimum between the two real numbers. 
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Note that p{u)p(u,w)du is the density of the distribution of the Markov chain after 
one step, given that it is initially distributed according to density p. Thus the preceding 
identity shows that the expectation of f is unchanged by the Markov chain, if it is 
initially distributed with density p. This means that if the Markov chain is distributed 
according to measure with density p initially then it will be distributed according to the 
same measure for all algorithmic time. 

• Note that, in order to implement Metropolis-Hastings MCMC methods, it is not nec¬ 
essary to know the normalisation constant for p{-) since only its ratio appears in the 
definition of the acceptance probability a. 


The Metropolis-Hastings algorithm defined above satisfies the following, which requires 
definition of TV distance given in section 11.31 

Corollary 3.6 For the Metropolis-Hastings MCMC methods we have that the detailed balance 
condition (13.121) is satisfied and that hence p, is invariant: if ^ p with Lebesgue density 
p, then ^ p for all n € Z’*'. Thus, if the Markov chain is ergodic, then the conclusions 
of Theorem \S.S[ hold. 

We now describe some exemplars of Metropolis-Hastings methods adapted to the data 
assimilation problem. These are not to be taken as optimal MCMC methods for data assimi¬ 
lation, but rather as examples of how to construct proposal distributions q{u, •) for Metropolis- 
Hastings methods in the context of data assimilation. In any given application the proposal 
distribution plays a central role in the efficiency of the MCMC method and tailoring it to 
the specifics of the problem can have significant impact on efficiency of the MCMC method. 
Because of the level of generality at which we are presenting the material herein (arbitrary / 
and h), we cannot discuss such tailoring in any detail. 

3.2.3. Deterministic Dynamics 

In the case of deterministic dynamics (12.3|) , the measure of interest is a measure on the initial 
condition uq in R". Perhaps the simplest Metropolis-Hastings algorithm is the Random 
Walk Metropolis (RWM) sampler which employs a Gaussian proposal, centred at the cur¬ 
rent state; we now illustrate this for the case of deterministic dynamics. Recall that the 
measure of interest is z/ with pdf g. Furthermore g oc exp(—ldet(r'o; 2/)) as given in Theorem 

Em 

The RWM method proceeds as follows: given that we are at € R", a current 

approximate sample from the posterior distribution on the initial condition, we propose 

+ y3^("-i) (3 15) 

where ^ ./V(0, Cprop) for some symmetric positive-definite proposal covariance Cprop 

and proposal variance scale parameter /3 > 0 ; natural choices for this proposal covariance 
include the identity / or the prior covariance Cq. Because of the symmetry of such a random 
walk proposal it follows that q(w, u) = q{u, w) and hence that 

/ N . diw) 
aiu, w) = \ /\ , . 

g{u) 

= 1A exp(ldet(u; y) - ldet(ty; y))■ 
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Remark 3.7 The expression for the acceptance probability shows that the proposed move to 
w is accepted with probability one if the value of ldet(’ ',y), the log-posterior, is decreased by 
moving to w from the current state u. On the other hand, if ldet(‘ ',y) increases then the 
proposed state is accepted only with some probability less than one. Recall that ldet(‘;2/) ts 
the sum of the prior penalization (background) and the model-data misfit functional. The 
algorithm thus has a very natural interpretation in terms of the data assimilation problem: it 
biases samples towards decreasing ldet(’; y) hence to improving the fit to both the model 
and the data in combination. 

The algorithm has two key tuning parameters: the proposal covariance Cprop and the scale 
parameter j3. See Remark \3.Ji\ first bullet, for discussion of the role of such parameters. The 
covariance can encode any knowledge, or guesses, about the relative strength of correlations 
in the model; given this, the parameter ft should be tuned to give an acceptance probability 
that is neither close to 0 nor to 1. This is because if the acceptance probability is small then 
successive states of the Markov chain are highly correlated, leading to a large constant K in 
(13.111) . On the other hand if the acceptance probability is close to one then this is typically 
because ft is small, also leading to highly correlated steps and hence to a large constant K in 

(ixm . 4|k 

Numerical results illustrating the method are given in section [3^ 


3.2.4. Stochastic Dynamics 

We now apply the Metropolis-Hastings methodology to the data assimilation smoothing prob¬ 
lem in the case of the stochastic dynamics model dal]). Thus the probability measure is on 
an entire signal sequence {vj}j^Q and not just on vq] hence it lives on It is possible 

to apply the random walk method to this situation, too, but we take the opportunity to 
introduce several different Metropolis-Hastings methods, in order to highlight the flexibility 
of the methodology. Furthermore, it is also possible to take the ideas behind the proposals 
introduced in this section and apply them in the case of deterministic dynamics. 

In what follows recall the measures yo and y defined in section 12.31 with densities po 
and p, representing (respectively) the measure on sequences v generated by (12.11) and the 
resulting measure when the signal is conditioned on the data y from (lO) . We now construct, 
via the Metropolis-Hastings methodology, two Markov chains which are invariant 

with respect to p. Hence we need only specify the transition kernel q{u,w), and identify the 
resulting acceptance probability a{u, w). The sequence {w^"^}nez+ will denote the proposals. 

Independence Dynamics Sampler Here we choose the proposal independently 

of the current state from the prior po with density po- Thus we are simply proposing 

independent draws from the dynamical model (EB, with no information from the data used 
in the proposal. Important in what follows is the observation that 

oc exp(-$(r;;?/)). (3.16) 

Poiv) 

With the given definition of proposal we have that q{u,w) = Pq{w) and hence that 


a(u, w) = 1 A 


= 1 A 


p{w)q{w,u) 

p{u)q{u,w) 

p{w)/po{w) 

p{u)/po{u) 


= 1 A exp($(u; y) - $(w; y)). 
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Remark 3.8 The expression for the acceptance probability shows that the proposed move to 
w is accepted with probability one if the value o/$(- ;y) is decreased by moving to w from the 
current state u. On the other hand, i/$(•;?/) increases then the proposed state is accepted 
only with some probability less than one, with the probability decreasing exponentially fast with 
respect to the size of the increase. Recall that $(•;?/) measures the fit of the signal to the data. 
Because the proposal builds in the underlying signal model the acceptance probability does not 
depend on l(-;?/), the negative log-posterior, but only the part reflecting the data, namely the 
negative log-likelihood. In contrast the RWM method, explained in the context of deterministic 
dynamics, does not build the model into its proposal and hence the accept-reject mechanism 
depends on the entire log-posterior; see Remark\3. 7[ 


The Independence Dynamics Sampler does not have any tuning parameters and hence 
can be very inefficient as there are no parameters to modify in order to obtain a reasonable 
acceptance probability; as we will see in the illustrations section 13.41 below, the method can 
hence be quite inefficient because of the resulting frequent rejections. We now discuss this 
point and an approach to resolve it. The rejections are caused by attempts to move far from 
the current state, and in particular to proposed states which are based on the underlying 
stochastic dynamics, but not on the observed data. This typically leads to increases in the 
model-data misfit functional $(.;?/) once the Markov chain has found a state which fits the 
data reasonably well. Even if data is not explicitly used in constructing the proposal, this 
effect can be ameliorated by making local proposals, which do not move far from the current 
state. These are exemplified in the following MCMC algorithm. 

The pCN Method. It is helpful in what follows to recall the measure i?o with density 
ttq found from /tq and po in the case where = 0 and given by equation (|2.24[) . We denote 
the mean by m and covariance by C, noting that m = [rn^,■ ■ ■ ,0”^)^ and that C is 
block diagonal with first block Co and the remainder all being E. Thus do = N{m,C). The 
basic idea of this method is to make proposals with the property that, if T = 0 so that 
the dynamics is Gaussian and with no time correlation, and if = 0 so that the data is 
totally uninformative, then the proposal would be accepted with probability one. Making 
small incremental proposals of this type then leads to a Markov chain which incorporates the 
effects of dt ^ 0 and h ^ 0 through the accept-reject mechanism. We describe the details of 
how this works. 

Recall the prior on the stochastic dynamics model with density po{v) oc exp(—J(t;)) given 
by (12.191) . It will be useful to rewrite ttq as follows: 

7 ro(u) oc exp(-J(u) -I- F{v)), 


where 


,7-1 


3=0 ^ '' 

Po{v) 

and hence that, using (12.221) . 


(3.17) 


We note that 


TToiv) 


OC exp(—^(u)) 


■^o{v) 


oc exp(—$(7;; y) — F{v)). 


(3.18) 


Recall the Gaussian measure do = N{m, C) defined via its pdf in (I2.24|) . The pCN method 
is a variant of random walk type methods, based on the following Gaussian proposal 


w(") =m +[I- 


(1-/32)= 


(n-l) 


(3.19) 
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N{0,C). 




Here is assumed to be independent of 


Lemma 3.9 Consider the Markov chain 

=m + (1-/3^)^ -m) + (3.20) 


/3e(o,i], ~iV(0,C') 


with independent of The Markov kernel for this chain q{u,w) satisfies detailed 

balance (13.121) with respect to the measure do with density ttq .' 


TTo{w)q{w,u) _ ^ 
TTo{u)q{u, w) 


(3.21) 


Proof We show that ttq («)<?(«, la) is symmetric in (u,w). To demonstrate this it suffices to 
consider the quadratic form found by taking the negative of the logarithm of this expression. 
This is given by 


u - mlc + - m - {1 - {u - m) 

Zp 


c- 


This is the same as 


1 

w 


m\c 


2/3' 


\w 


( 1 -/ 32 )^ 


{w — m,u — m)c 


which is clearly symmetric in {u,w). The result follows. □ 

By use of (I3.21|) and (I3.18|) we deduce that the acceptance probability for the MCMC 
method with proposal (I3.19P is 


a(M, tc) = 1 A 


= 1 A 


p{w)q{w,u) 

p{u)q{u,w) 

p{w)/tto{w) 

p{u)/tto{u) 


= 1 A exp($(u; y) — ^{w, y) + F{u) — F{w)). 


Recall that the proposal preserves the underlying Gaussian structure of the stochastic dynam¬ 
ics model; the accept-reject mechanism then introduces non-Gaussianity into the stochastic 
dynamics model, via F^ and introduces the effect of the data, via $. By choosing /3 small, so 
that is close to we can make w^”^) reasonably large and obtain a usable 

algorithm. This is illustrated in section 15^ 

Recall from subsection 12.3.31 that, if 'k = 0 (as assumed to define the measure do), then 
the noise sequence is identical with the signal sequence More generally, 

even if 0, the noise sequence together with vq, a vector which we denote in 

subsection 12. 3. 31 bv uniquely determines the signal sequence {nj}|TQ: see Lemma [2Jl This 
motivates a different formulation of the smoothing problem for stochastic dynamics where 
one views the noise sequence and initial condition as the unknown, rather than the signal 
sequence itself. Here we study the implication of this perspective for MCMC methodology, 
in the context of the pCN Method, leading to our third sampler within this subsection: the 
pCN Dynamics Sampler. We now describe this algorithm. 

The pCN Dynamics Sampler is so-named because the proposal (implicitly, via the 
mapping G dehned in Lemma [2.9D samples from the dynamics as in the Independence Sampler, 
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while the proposal also includes a parameter (3 allowing small steps to be taken and chosen to 
ensure good acceptance probability, as in the pCN Method. The posterior measure we wish 
to sample is given in Theorem l2.10l Note that this theorem implicitly contains the fact that 

d(dO oc exp(-$r(C;2/))’?o(dO- 

Furthermore do = N{m, C) where the mean m and covariance C are as described above for 
the standard pCN method. We use the pCN proposal (13.1911 : 

(W -m) 

and the acceptance probability is given by 

C) = 1A exp ($r(C; y) - ^'r(C; y)) ■ 


When interpreting this formula it is instructive to note that 


^r(C;y) = \ \y-GiO\lj 


^jhy-GiO) 




and that ^ comprises both vq and the noise sequence {^}jTo^. Thus the method has the same 
acceptance probability as the Independence Dynamics Sampler, albeit expressed in terms of 
initial condition and model noise rather than signal, and also possesses a tunable parameter 
f3; it thus has the nice conceptual interpretation of the acceptance probability that is present 
in the Independence Dynamics Sampler, as well as the advantage of the pCN method that 
the proposal variance /3 may be chosen to ensure a reasonable acceptance probability. 


3.3 Variational Methods 

Sampling the posterior using MCMC methods can be prohibitively expensive. This is because, 
in general, sampling involves generating many different points in the state space of the Markov 
chain. It can be of interest to generate a single point, or small number of points, which 
represent the salient features of the probability distribution, when this is possible. If the 
probability is peaked at one, or a small number of places, then simply locating these peaks 
may be sufficient in some applied contexts. This is the basis for variational methods which 
seek to maximize the posterior probability, thereby locating such peaks. In practice this boils 
down to minimizing the negative log-posterior. 

We start by illustrating the idea in the context of the Gaussian distributions highlighted 
in section [TT] concerning the Kalman smoother. In the case of stochastic dynamics. Theorem 
[sm shows that P(u|?/), the pdf of the posterior distribution, has the form 

P{v\y) oc exp(^-i|u - m|i) . 

Now consider the problem 

V* = argmax„gR|joixnP(z;|?/). 

From the structure of P(u|?/) we see that 

V* = argmin„gR|joixnl(u;?/) 

where 

Kv-,y) = 
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Thus V* = TO, the mean of the posterior. Similarly, using Theorem 13.21 we can show that in 
the case of deterministic dynamics , 

Vq = argmax„^gR|joixnP(uo|2/), 

in the case of deterministic dynamics, is given by Vq = rridet- 

In this section we show how to characterize peaks in the posterior probability, in the 
general non-Gaussian case, leading to problems in the calculus of variations. The methods 
are termed variational methods. In the atmospheric sciences these variational methods are 
referred to as 4DVAR; this nomenclature reflects the fact that they are variational methods 
which incorporate data over three spatial dimensions and one temporal dimension (thus four 
dimensions in total), in order to estimate the state. In Bayesian statistics the methods are 
called MAP estimators: maximum a posteriori estimators. It is helpful to realize that the 
MAP estimator is not, in general, equal to the mean of the posterior distribution. However, 
in the case of Gaussian posteriors, it is equal to the mean. Gomputation of the mean of a 
posterior distribution, in general, requires integrating against the posterior distribution. This 
can be achieved, via sampling for example, but is typically quite expensive, if sampling is 
expensive. MAP estimators , in contrast, only require solution of an optimization problem. 
Unlike the previous section on MGMG methods we do not attempt to overview the vast 
literature on relevant algorithms (here optimization algorithms); instead references are given 
in the bibliographic notes of section 13.51 

First we consider the case of stochastic dynamics . 

Theorem 3.10 Consider the data assimilation problem for stochastic dynamics : m, 
(1^ . with 4' G C'i(M”,M’") and h G C^(M”,IR'"). Then: 

• (i) the infimum of\{- \y) given in S2.21\) is attained at at least one point v* It 

follows that the density p{v) = P(t;|?/) associated with the posterior probability 

pL given by Theorem l2.iSI is maximized at v *; 

• (ii) furthermore, let B{u,5) denote a ball in of radius S and centred at u. Then 

]m = exp(l(u 2 ;?/) - I(mi; 2 /)) for all mi, U 2 G (3.22) 

Proof Note that !(• ;y) is non-negative and continuous so that the infimum I is finite and 
non-negative. To show that the infimum of !(• ;2/) is attained in rIUIx" denote a 

minimizing sequence. Without loss of generality we may assume that, for all n G N, 

l(u(”);y) <T+1. 

From the structure of !(•; y) it follows that 

1 

uo = Too + GJro, 

Vj+i = 4'(uj) -h j G Z+ 

where jlcjP <1 + 1 for all j G Z"*". By iterating and using the inequalities on the \rj\ we 
deduce the existence of A > 0 such that | < A for all n G N. From this bounded sequence 
we may extract a convergent subsequence, relabelled to for simplicity, with limit v*. By 
construction we have that —>■ v* and, for any e > 0, there is A = N{e) such that 

T< l(u(’");j/) <T+e, Vn>A. 
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Hence, by continuity of l( - ;2/), it follows that 

T < y) < T + e. 

Since e > 0 is arbitrary it follows that 1 ( 1 ;*;?/) = T. Because 

Kdv) = ^exp{-\{v;y))dv 
= p{v)dv 


it follows that u* also maximizes the posterior pdf p. 

For the final result we first note that, because 'I' and h are continuously differentiable, the 
function /(• ;2/) is continuously differentiable. Thus we have 


F^{B{u,S)) = ^ f exip{-\{v;y))dv 
^ J |d— u| <(5 




l-H —<(5 


^exp(—l(u; y)) + e{u] v — u)^ dv 


where 

e(n:v-u) = {-l 

As a consequence we have, for > 0, 

—K~\5\ < e{u; v — u) < Ar'''|i5| 

for u = ui,U 2 and |u— m| < S. Using the preceding we find that, for E := exp(l(u 2 ; y) — \{ui;y)) 
W{B{uuS)) ^ I\.-m\<s^Md^^\d\)dv exp(X+|d|) 

¥t^lBiu2,6)) - ^ - 

Similarly we have that 


I\v-u,\< 5 ^M-d<~\d\)dv exp{-K |5|)' 


P^(H(ui,d)) ^J\v-m\<S 

^ h/—n - 


/|.-«p< 6 exp(-i^ \S\)dv exp(-iF-|,5|) 

= £/■ 


¥i^{B{u2,5)) ' ^ /|^_^^|<5exp(if+|5|)du 
Taking the limit d —?► 0 gives the desired result. 


exp(A"+|d|) 


□ 


Remark 3.11 The second statement in Theorem I j’.lOl may appear a little abstract. It is, 
however, essentially a complicated way of restating the first statement. To see this fix U 2 
and note that the right hand side of (j3.22p is maximized at point ui which minimizes l(-;?/). 
Thus, independently of the choice of any fixed U 2 , the identity (j3.22p shows that the probability 
of a small ball of radius 6 centred at ui is, approximately, maximized by choosing centres at 
minimizers o/l(- ^y). Why, then, do we bother with the second statement? We do so because it 
makes no reference to Lehesgue density. As such it can be generalized to infinite dimensions, 
as is reguired in continuous time for example. We include the second statement for precisely 
this reason. We also remark that our assumption on continuous differentiability of d/ and h 
is stronger than what is needed, but makes for the rather explicit bounds used in the preceding 
proof and is hence pedagogically desirable. ^ 
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The preceding theorem leads to a natural algorithm: compute 


V = argmin^gjjuoixn I(m; y). 

In applications to meteorology this algorithm is known as weak constraint 4DVAR, and we 
denote this as w4DVAR in what follows. The word “weak” in this context is used to indicate 
that the deterministic dynamics model (I2.3h l is not imposed as a strong constraint. Instead 
the objective functional l(- \y) is minimized; this penalizes deviations from exact satisfaction 
of the deterministic dynamics model, as well as deviations from the data. 

The w4DVAR method generalizes the standard 4DVAR method which may be derived 
from w4DVAR in the limit E —> 0 so that the prior on the model dynamics (EID is de¬ 
terministic, but with a random initial condition, as in (12.311 . In this case the appropriate 
minimization is of ldet(no; ?/) given by (12.291) . This has the advantage of being a lower di¬ 
mensional minimization problem than w4DVAR; however it is often a harder minimization 
problem, especially when the dynamics is chaotic. The basic 4DVAR algorithm is sometimes 
called strong constraint 4DVAR to denote the fact that the dynamics model (12.Sh i is 
imposed as a strong constraint on the minimization of the model-data misfit with respect to 
the initial condition; we simply refer to the method as 4DVAR. The following theorem may 
be proved similarly to Theorem 13.101 


Theorem 3.12 Consider the data assimilation problem for deterministic dynamics: (1231), ([221) 
with 4^ € and h G R™). Then: 


• (i) the infimum of ldet(’; y) given in \2.2tlfl is attained at at least one point Vq in R". It 
follows that the density g{vo) = P(i'o|2/) on R" associated with the posterior probability 
V given by Theorem \2.11\ is maximized at Vq ; 

• (ii) furthermore, if B{z,S) denotes a hall in R" of radius S, centred at z, then 


F''{B{z,,S)) 
lim — 7 —^ 
s^o F''{B{z2,S)) 


exp(ldet(22;2/) - ldet(zi;?/)). 


As in the case of stochastic dynamics we do not discuss optimization methods to perform 
minimization associated with variational problems; this is because optimization is a well- 
established and mature research area which is hard to do justice to within the confines of this 
book. However we conclude this section with an example which illustrates certain advantages 
of the Bayesian perspective over the optimization or variational perspective. Recall from 
Theorem 12.151 that the Bayesian posterior distribution is continuous with respect to small 
changes in the data. In contrast, computation of the global maximizer of the probability may 
be discontinuous as a function of data. To illustrate this consider the probability measure 
on R with Lebesgue density proportional to exp(— I4'^(m)) where 

V’^{u) = ^{1 — + eu. (3.23) 

It is a straightforward application of the methodology behind the proof of Theorem 12.151 to 
show that yl is Lipschitz continuous in e, with respect to the Hellinger metric. Furthermore 
the methodology behind Theorems 13.101 and 13.121 shows that the probability with respect to 
this measure is maximized where is minimized. The global minimum, however, changes 
discontinuously, even though the posterior distribution changes smoothly. This is illustrated 
in Figure 13.11 where the left hand panel shows the continuous evolution of the probability 
density function, whilst the right hand-panel shows the discontinuity in the global maximizer 
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of the probability (minimizer of V^) as e passes through zero. The explanation for this dif¬ 
ference between the fully Bayesian approach and MAP estimation is as follows. The measure 
has two peaks, for small e, close to ±1. The Bayesian approach accounts for both of these 
peaks simultaneously and weights their contribution to expectations. In contrast the MAP 
estimation approach leads to a global minimum located near u = —1 for e > 0 and near 
u = +1 for e < 0, resulting in a discontinuity. 



(a) y' for e > 0, e < 0, and e = 0. (b) Global minima of V as a function of e 


Figure 3.1; Plot of (13.231) shows discontinuity of the global maximum as a function of e. 


3.4 Illustrations 

We describe a range of numerical experiments which illustrate the application of MCMC 
methods and variational methods to the smoothing problems which arise in both deterministic 
and stochastic dynamics . 

The first illustration concerns use of the RWM algorithm to study the smoothing distri¬ 
bution for Example 12.41 in the case of deterministic dynamics where our aim is to find P(t;o|?/). 
Recall Figure [2T3K which shows the true posterior pdf, found by plotting the formula given 
in Theorem 12.81 We now approximate the true posterior pdf by the MCMC method, using 
the same parameters, namely mg = 0.5,(170 = 0.01,7 = 0.2 and uj = 0.3. In Figure [3^ we 
compare the posterior pdf calculated by the RWM method (denoted by the histogram 
of the output of the Markov chain) with the true posterior pdf p. The two distributions are 
almost indistinguishable when plotted together in Figure I3.21 l: in Figure 13.2b we plot their 
difference, which as we can see is small, relative to the true value. We deduce that the number 
of samples used, N = 10®, results here in accurate sampling of the posterior. 

We now turn to the use of MCMC methods to sample the smoothing pdf ]P(u|?/) in the 
case of stochastic dynamics (EID, using the Independence Dynamics Sampler and both pCN 
methods. Before describing application of numerical methods we study the ergodicity of the 
Independence Dynamics Sampler in a simple, but illustrative, setting. For simplicity assume 
that the observation operator h is bounded so that, for all u G , \h{u)\ < /imax- Then, 
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Figure 3.2: Comparison of the posterior for Example 12.41 for r = 4 using random walk 
metropolis and equation (12.291) directly as in the MATLAB program p2 .m. We have used 
J = 5 Co = 0.01, mo = 0.5, 7 = 0.2 and true initial condition vq = 0.3, see also p3 .m in 
section [ 5 . 2. II We have used N = 10® samples from the MCMC algorithm. 


recalling the notation Yj = from the filtering problem, we have 


,7-1 

j=o 

- + -^^max) 
i=0 

< |r-^|"(|Fjp +JhLx) 

— • ^max- 

Since <i> > 0 this shows that every proposed step is accepted with probability exceeding 
and hence that, since proposals are made with the prior measure fj,Q describing the unobserved 
stochastic dynamics , 

p{u,A) > e“^"*“/7o(^). 

Thus Theorem 13.31 applies and, in particular, (13.101) and (13.111) hold, with e = under 

these assumptions. This positive result about the ergodicity of the MCMC method, also indi¬ 
cates the potential difficulties with the Independence Dynamics Sampler. The Independence 
Sampler relies on draws from the prior matching the data well. Where the data set is large 
(J 1) or the noise covariance small (|r| <C 1) this will happen infrequently, because $niax 
will be large, and the MCMC method will reject frequently and be inefficient. To illustrate 
this we consider application of the method to the Example 12.31 using the same parameters 
as in Figure 1^31 specifically we take a = 2.5 and E = = I. We now sample the posterior 

distribution and then plot the resulting accept-reject ratio a for the Independence Dynamics 
Sampler, employing different values of noise F and different sizes of the data set J. This is 
illustrated in Figure 331 

In addition, in Figure 13.41 we plot the output, and the running average of the output, 
projected into the first element of the vector the initial condition - remember that we 
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Figure 3.3: Accept-reject probability of the Independence Sampler for Example 12.31 for a = 
2.5, E = (T^ = 1 and F = 7 ^ for different values of 7 and J. 


are defining a Markov chain on - for = 10 ® steps. Figure |33k clearly exhibits the 
fact that there are many rejections caused by the low average acceptance probability. Figure 
13.4b shows that the running average has not converged after 10® steps, indicating that the 
chains needs to be run for longer. If we run the Markov chain over N = 10® steps then we 
do get convergence. This is illustrated in Figure [331 In Figure [33h we see that the running 
average has converged to its limiting value when this many steps are used. In Figure 13.5b 
where we plot the marginal probability distribution for the first element of calculated 
from this converged Markov chain. 

In order to get faster convergence when sampling the posterior distribution we turn to 
application of the pCN method. Unlike the Independence Dynamics Sampler, this contains a 
tunable parameter which can vary the size of the proposals. In particular, the possibility of 
making small moves, with resultant higher acceptance probability, makes this a more flexible 
method than the Independence Dynamics Sampler. In Figure [33] we show application of the 
pCN sampler, again considering Example 12.31 for a = 2.5, E = cr^ = 1 and F = 7 ^ = 1, with 
J = 10, the same parameters used in Figure 331 

In the case that the dynamics are significantly influencing the trajectory, i.e. the regime 
of large 4' or small a, it may be the case that the standard pCN method is not effective, 
due to large effects of the G term, and the improbability of Gaussian samples being close 
to samples of the prior on the dynamics. The pCN Dynamics sampler, recall, acts on the 
space comprising the the initial condition and forcing, both of which are Gaussian under the 
prior, and so may sometimes have an advantage given that pGN -type methods are based on 
Gaussian proposals. The use of this method is explored in Figure 13.71 for Example 12.31 for 
a = 2.5, E = (T^ = 1 and F = 7 ^ = 1, with J = 10. 

We now turn to variational methods; recall Theorems 13.101 and 13.121 in the stochastic 
and deterministic cases respectively. In Figure 13.8h we plot the MAP (4DVAR) estimator 
for our Example 12.11 choosing exactly the same parameters and data as for Eigure [2.10h . in 
the case where J = 10^. In this case the function ldet(’ \ y) is quadratic and has a unique 
global minimum. A straightforward minimization routine will easily find this: we employed 
standard matlab optimization software initialized at three different points. From all three 
starting points chosen the algorithm hnds the correct global minimizer. 

In Figure 13.8b we plot the MAP (4DVAR) estimator for our Example 12.41 for the case 
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(a) First element of 



Figure 3.4: Output and running average of the Independence Dynamics Sampler after K = 
10® steps, for Example 12.31 for a = 2.5, E = cr^ = 1 and F = 7 ^ = 1, with J = 10, see also 
p4 .m in section 15.2.21 


r = 4 choosing exactly the same parameters and data as for Figure d.131 We again employ a 
MATLAB optimization routine, and we again initialize it at three different points. The value 
obtained for our MAP estimator depends crucially on the choice of initial condition in our 
minimization procedure. In particular, on the choices of starting point presented: for the 
three initializations shown, it is only when we start from 0.2 are we able to find the global 
minimum of ldet(i'o; J/)- By Theorem 13.121 this global minimum corresponds to the maximum 
of the posterior distribution, and we see that finding the MAP estimator is a difficult task 
for this problem. Starting with the other two initial conditions displayed we converge to one 
of the many local minima of ldet(^^o; 2 /); these local minima are in fact regions of very low 
probability, as we can see in Figure [2.13h . This illustrates the care required when computing 
4DVAR solutions in cases where the forward problem exhibits sensitivity to initial conditions. 

Figure lHT^ shows application of the w4DVAR method, or MAP estimator given by Theorem 
13.101 in the case of the Example 12.31 with parameters set atJ = 5,7 = cr = 0.1. In contrast 
to the previous example, this is no longer a one-dimensional minimization problem: we are 
minimizing \{v;y) given by (12.211) over u G R®, given the data y G R®. The figure shows 
that there are at least 2 local minimizers for this problem, with closer to the truth than 
and with considerably smaller that I{v^^^;y). However has a larger basin 

of attraction for the optimization software used: many initial conditions lead to while 
fewer lead to Furthermore, whilst we believe that is the global minimizer, it is 

difficult to state this with certainty, even for this relatively low-dimensional model. To get 
greater certainty an exhaustive and expensive search of the six dimensional parameter space 
would be needed. 


3.5 Bibliographic Notes 

• The Kalman Smoother from subsection ED leads to a system of linear equations, char¬ 
acterized in Theorem EH These equations are of block tridiagonal form, and may be 
solved by LU factorization. The Kalman filter corresponds to the LU sweep in this 
factorization, a fact that was highlighted in [23] . 
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Figure 3.5: Running average and probability density of the first element of for the 
Independence Dynamics Sampler after K = 10® steps, for Example 12.31 for a = 2.5, S = cr^ = 
1 and r = 7 ^, with 7 = 1 and J = 10, see also p4 .m in section [5.2.21 


• Section o Monte Carlo Markov Chain methods have a long history, initiated in the 
1953 paper [86] and then generalized to an abstract formulation in the 1970 paper [53] . 
The subject is overviewed from an algorithmic point of view in m- Theorem 13.31 is 
contained in m, and that reference also contains many other convergence theorems for 
Markov chains; in particular we note that it is often possible to increase substantially 
the class of functions ip to which the theorem applies by means of Lyapunov function 
techniques, which control the tails of the probability distribution. The specific form of 
the pCN-MCMC method which we introduce here has been chosen to be particularly 
effective in high dimensions; see [31] for an overview, [16] for the introduction of pCN 
and other methods for sampling probability measures in infinite dimensions, in the 
context of conditioned diffusions, and m for the application to a data assimilation 
problem. 

The key point about pCN methods is that the proposal is reversible with respect to an 
underlying Gaussian measure. Even in the absence of data, if 0 then this Gaussian 
measure is far from the measure governing the actual dynamics. In contrast, still in 
the absence of data, this Gaussian measure is precisely the measure governing the noise 
and initial condition, giving the pGN Dynamics Sampler a natural advantage over the 
standard pCN method. In particular, notice that the acceptance probability is now 
determined only by the model-data misfit for the pGN Dynamics Sampler, and does 
not have to account for incorporation of the dynamics as it does in the original pGN 
method; this typically improves the acceptance rate of the pGN Dynamics Sampler over 
the standard pGN method. Therefore, this method may be preferable, particularly in 
the case of unstable dynamics. The pGN Dynamics Sampler was introduced in [30] and 
further trialled in [55] : it shows considerable promise. 

The subject of MGMG methods is an enormous one to which we cannot do justice in 
this brief presentation. There are two relevant time-scales for the Markov chain: the 
burn-in time which determines the time to reach part of state-space where most of the 
probability mass is concentrated, and the mixing time which determines the time taken 
to fully explore the probability distribution. Our brief overview would not be complete 
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Figure 3.6: Trace-plot and running average of the first element of for the pCN sampler 
after K = 10® steps, for Example 12.31 with a = 2.5, E = cr^ = 1 and F = 7 ^ = 1, with J = 10, 
see also p5 .m in section 15.2.31 


without a cursory discussion of convergence diagnostics [H] which attempt to ensure 
that the Markov chain is run long enough to have both burnt-in and mixed. Whilst 
none of the diagnostics are foolproof, there are many simple tests that can and should 
be undertaken. The first is to simply study (as we have done in this section) trace plots 
of quantities of interest (components of the solution, acceptance probabilities) and the 
running average of these quantities of interest. More sophisticated diagnostics are also 
available. For example, comparison of the within-chain and between-chain variances 
of multiple chains beginning from over-dispersed initial conditions is advocated in the 
works [45l [22] . The authors of those works advise to apply a range of tests based on 
comparing inferences from individual chains and a mixture of chains. These and other 
more sophisticated diagnostics are not considered further here, and the reader is referred 
to the cited works for further details and discussion. 

• Section 13.31 Variational Methods, known as 4DVAR in the meteorology community 
and widely used in practice, have the distinction, when compared with the ad hoc non- 
Gaussian filters described in the next chapter which are also widely used in practice in 
their EnKF and 3DVAR formulations, of being well-founded statistically: they corre¬ 
spond to the maximum a posteriori estimator (MAP estimator) for the fully Bayesian 
posterior distribution on model state given data [64) . See [na and the references 
therein for a discussion of the applied context; see |35] for a more theoretical presen¬ 
tation, including connections to the Onsager-Machlup functional arising in the theory 
of diffusion processes . The European Centre for Medium-Range Weather Forecasts 
(ECMWF) runs a weather prediction code based on spectral approximation of contin¬ 
uum versions of Newton’s balance laws, together with various sub-grid scale models. 
Initialization of this prediction code is based on the use of 4DVAR like methods. The 
conjunction of this computational forward model, together with the use of 4DVAR to 
incorporate data, results in what is the best weather predictor, worldwide, according 
to a widely adopted metric by which the prediction skill of forecasts is measured. The 
subject of algorithms for optimization, which of course underpins variational methods, 
is vast and we have not attempted to cover it here; we mention briefly that many 
methods use first derivative information (for example steepest descent methods) and 
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(a) First element of 



Figure 3.7: Trace-plot and running average of the first element of for the pCN dynamics 
sampler after K = 10® steps, for Example 12.31 with a = 2.5, E = cr^ = 1 and F = 7 ^ = 1, 
with J = 10, see also p 6 .m in section [5.2.31 


second derivative information (Newton methods); the reader is directed to [9T] for de¬ 
tails. Derivatives can also be useful in making MCMC proposals, leading the Langevin 
the hybrid Monte Carlo methods, for example; see [51] and the references therein. 


3.6 Exercises 

1 . Consider the posterior distribution on the initial condition, given by Theorem 12.111 in 
the case of deterministic dynamics. In the case of Example 12.41 program p2 . m plots the 
prior and posterior distributions for this problem for data generated with true initial 
condition vq = 0.1 Why is the posterior distribution concentrating much closer to 0.9 
than to the true initial condition at 0.1? Change the mean of the prior from 0.7 to 0.3; 
what do you observe regarding the effect on the posterior. Explain what you observe. 
Illustrate your findings with graphics. 

2. Consider the posterior distribution on the initial condition, given by Theorem 12.111 
in the case of deterministic dynamics. In the case of Example 12.41 program p3.in 
approximates the posterior distribution for this problem for data generated with true 
initial condition vq = 0.3. Why is the posterior distribution in this case approximately 
symmetric about 0.5? What happens if the mean of the prior is changed from 0.5 to 
0.1? Explain what you observe. Illustrate your findings with graphics. 

3. Consider the posterior distribution on the initial condition, given by Theorem 12.111 
in the case of deterministic dynamics. In the case of Example 12.41 program p3.in 
approximates the posterior distribution for this problem. Modify the program so that 
the prior and data are the same as for the first exercise in this section. Compare 
the approximation to the posterior obtained by use of program p3.in with the true 
posterior as computed by program p 2 .m. Carry out similar comparisons for different 
choices of prior, ensuring that programs p2 .m and p3 .m share the same prior and the 
same data. In all cases experiment with the choice of the parameter /3 in the proposal 
distribution within p3 .m, and determine its effect on the displayed approximation of 
the true posterior computed from p 2 .m. Illustrate your findings with graphics. 
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Figure 3.8: Finding local minima of I{vo;y) for Examples 12.11 and 12.41 The values and 
the data used are the same as for Figures B.lOh and 12.13b . (o,*, □) denote three different 
initial conditions for the starting the minimization process. (—8, —2, 8) for Example 12.11 and 
(0.05,0.2,0.4) for Example!^ 


4. Consider the posterior distribution on the initial condition, given by Theorem 12.111 
in the case of deterministic dynamics. In the case of Example 12.41 program p3.in 
approximates the posterior distribution for this problem. Modify the program so that 
it applies to Example 12.31 Experiment with the choice of the parameter J, which 
determines the length of the Markov chain simulation, within p3.in. Illustrate your 
findings with graphics. 

5. Consider the posterior distribution on the signal, given by Theorem 12.81 in the case 
of stochastic dynamics . In the case of Example 12.31 program p4.m approximates 
the posterior distribution for this problem, using the Independence Dynamics Sampler. 
Run this program for a range of values of 7. Report and explain the effect of 7 on the 
acceptance probability curves. 

6. Consider the posterior distribution on the signal, given by Theorem 12.81 in the case 
of stochastic dynamics . In the case of Example 12.31 program p5 .m approximates the 
posterior distribution for this problem, using the pCN sampler. Run this program for 
a range of values of 7. Report and explain the effect of /3 on the acceptance probability 
curves. 

7. Consider the posterior distribution on the signal, given by Theorem 12.81 in the case 
of stochastic dynamics . In the case of Example 12.31 program p6 . m approximates the 
posterior distribution for this problem, using the pCN dynamics sampler. Run this 
program for a range of values of 7. Report and explain the effect of a and of J on the 
acceptance probability curves. 

8. Consider the MAP estimator for the posterior distribution on the signal, given by The¬ 
orem [3110] in the case of stochastic dynamics . Program p7 .m finds the MAP estimator 
for Example l2.3l Increase J to 50 and display your results graphically. Now repeat your 
experiments for the values 7 = 0.01,0.1 and 10 and display and discuss your findings. 
Repeat the experiments using the “truth” as the initial condition for the minimization. 
What effect does this have? Explain this effect. 
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Figure 3.9: Weak constraint 4DVAR for J = 5,7 = ct = 0.1, illustrating two local minimizers 
and see also p7 .m in section [5.2.51 


9. Prove Theorem 13.121 

10. Consider application of the RWM proposal (13.151) . applied in the case of stochastic 
dynamics . Find the form of the Metropolis-Hastings acceptance probability in this 
case. 

11. Consider the family of probability measures on M with Lebesgue density proportional 
to exp(—R'^(u)) with V'^(u) given by (I3.23p . Prove that the family of measure is 
locally Lipschitz in the Hellinger metric and in the total variation metric. 
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Chapter 4 


Discrete Time: Filtering Algorithms 


In this chapter we describe various algorithms for the filtering problem. Recall from section 
12.41 that filtering refers to the sequential update of the probability distribution on the state 
given the data, as data is acquired, and that Yj = denotes the data accumulated 

up to time j. The filtering update from time j to time j + 1 may be broken into two steps: 
prediction which is based on the equation for the state evolution, using the Markov kernel for 
the stochastic or deterministic dynamical system which maps ¥(vj\Yj) into 'S‘(vj+i\Yj)', and 
analysis which incorporates data via Bayes’ formula and maps ^(vj+i\Yj) into P(uj+i|y}+i). 
All but one of the algorithms we study (the optimal proposal version of the particle filter) 
will also reflect these two steps. 

We start in section 14.11 with the Kalman filter which provides an exact algorithm to 
determine the filtering distribution for linear problems with additive Gaussian noise. Since 
the filtering distribution is Gaussian in this case, the algorithm comprises an iteration which 
maps the mean and covariance from time j to time j + 1. In section 14.21 we show how the 
idea of Kalman filtering may be used to combine dynamical model with data for nonlinear 
problems; in this case the posterior distribution is not Gaussian, but the algorithms proceed 
by invoking a Gaussian ansatz in the analysis step of the filter. This results in algorithms 
which do not provably approximate the true filtering distribution in general; in various forms 
they are, however, robust to use in high dimension. In section [4.31 we introduce the particle 
filter methodology which leads to provably accurate estimates of the true filtering distribution 
but which is, in its current forms, poorly behaved in high dimensions. The algorithms in 
sections I4.III4.3I are concerned primarily with stochastic dynamics, but setting E = 0 yields 
the corresponding algorithms for deterministic dynamics. In section 14.41 we study the long 
time behaviour of some of the filtering algorithms introduced in the previous sections. Finally, 
in section [4.51 we present some numerical illustrations and conclude with bibliographic notes 
and exercises in sections 14.61 and Wl\ 

For clarity of exposition we again recall the form of the data assimilation problem. The 
signal is governed by the model of equations (EH: 

Vj+i = j € Z+, 

Vo 7V(mo,C'o), 

where ^ is an i.i.d. sequence, independent of vq, with ^ N{0, E). The data is 

given by equation (12.21) : 

%+i = + Vj+i, j e 

where h : M" —>■ and r] = {?7j}jgz+ is an i-i.d. sequence, independent of (uq,^), with 
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4.1 Linear Gaussian Problems: The Kalman Filter 


This algorithm provides a sequential method for updating the filtering distribution ¥[vj\Yj) 
from time j to time j +1, when 'I' and h are linear maps. In this case the filtering distribution 
is Gaussian and it can be characterized entirely through its mean and covariance. To see this 
we note that the prediction step preserves Gaussianity by Lemma 11.51 the analysis step 
preserves Gaussianity because it is an application of Bayes’ formula ini and then Lemma 
1 1.61 establishes the required Gaussian property since the log pdf is quadratic in the unknown. 
To be concrete we let 

= Mv, h{v) = Hv (4.2) 

for matrices M € G We assume that m < n and Rank(iL) = m. We let 

{rrij, Cj) denote the mean and covariance of Vj\Yj, noting that this entirely characterizes the 
random variable since it is Gaussian. We let (wj+i, Cj+i) denote the mean and covariance 
of Uj-)-i|Yj, noting that this too completely characterizes the random variable, since it is also 
Gaussian. We now derive the map {rrij, Cj) i—>■ {rtij+i, Cj+i), using the intermediate variables 
{fhj+i, Cj+i) so that we may compute the prediction and analysis steps separately. This gives 
the Kalman filter in a form where the update is expressed in terms of precision rather than 
covariance. 

Theorem 4.1 Assume that Co, T, S > 0. Then Cj > 0 for all j G Z+ and 

= {MCjM^+ Y)-^ (4.3a) 
C7 +\toj+i = {MCjM^+ Y)-^Mmj + (4.3b) 

Proof We assume for the purposes of induction that Cj > 0 noting that this is true for j = 0 
by assumption. The prediction step is determined by m in the case 4>(-) = M-: 

Vj+i = Mvj + £,j, ^j ~ iV(0, E). 


From this it is clear that 


E(u,+i|F,) = E{Mv,\Y,) + E(^,|F,)- 
Since f,j is independent of Yj we have 

inj+i = Mmj. (4-4) 


Similarly 

E((uj+i - fhj+i) 0 (vj+i - mj+i)\Yj) = E{M(vj - nij) ®M(vj - TOj)|Yj) +E(^j 0 

+ E{M{vj - ruj) 0 ^j\Yj) + E(^j ®M{vj - mj)\Yj). 

Again, since ^j is independent of Yj and of Vj, we have 

Cj+i = ME{{vj — nij) 0 {vj — mj)\Yj)M'^ + S 

= MCjM^ + E. (4.5) 


Note that Q+i > 0 because Cj > 0 by the inductive hypothesis and E > 0 by assumption. 
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Now we consider the analysis step . By (I2.31|) . which is just Bayes’ formula, and using 
Gaussianity, we have 

/1| |2\ /111 |2l|'^—— |2\ 

exp(^--|u-mj+i|^,^J oc exp[--\r-^ {y^+i - Hv)\ -- %+i)| ^4.6a) 

/ 1 I 1 2 1 I 1 2 \ 

= exp(^--|yj+i-i/z;|j,--|u-mj+i|g,_^J. (4.6b) 

Equating quadratic terms in v gives, since E > 0 by assumption, 

= a-+\ + H^r-^H (4.7) 

and equating linear terms in v gives 0 

C-^irrij+i = 5-^1%+! + (4.8) 

Substituting the expressions da and (14.51) for {mj+i,Cj+i) gives the desired result. It 
remains to verify that Cj+i > 0. From (j4.7ll it follows, since > 0 by assumption and 
Cj+i > 0 (proved above), that > 0. Hence Cj+i > 0 and the induction is complete. □ 
We may now reformulate the Kalman filter using covariances directly, rather than using 
precisions. 

Corollary 4.2 Under the assumptions of Theorem E3 the formulae for the Kalman filter 
given there may be rewritten as follows: 

= yj-\-i 

Sj+i = HC,+iH^ + r, 

Kj+i = 

TTLj - 1-1 — TTlj -|-1 K j -|- 1 dj - 11 , 
c,+i = (/-iF,+iff)a,+i, 


with {rhj+i,Cj+i) given in 
Proof By (14.71) we have 




J~r 

and application of Lemma 14.41 below gives 

C,+1 = a^+i - dj+iH^{T + Hdj+iH^)-^Hdj+i 


= (/ - C,+iiJ^(r + 

= (/ - C,+,H^S-^,H)C,+, 
= (/ - K,+,H)d,+, 


as required. Then the identity (14.8p gives 

iTT'j+i = Cj+iCj^-^^ruj+i + Cj-i-iiF^r ^yj+i 

= (/ - Kj+iH)fh,+i + Cj+iH^T-^y,+i. (4.9) 

^We do not need to match the constant terms (with respect to v) since the normalization constant in 
Bayes theorem deals with matching these. 
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Now note that, again by (1471) . 


Q+i(a-+\ + = I 


so that 


Cj+iH^T-^H = / - C,+i5-+\ 

= I-{I-K,+,H) 
= K,+iH. 


Since H has rank m we deduce that 


Hence (14.91) gives 


= Kj+i. 


TTij+i = {I - Kj+iH)mj+i + Kj+iyj+i = fhj+i + K^+idj+i 


as required. □ 

Remark 4.3 The key difference between the Kalman update formulae in Theorem ED and 
in Corollary If.Sj is that, in the former matrix inversion takes place in the state space, with 
dimension n, whilst in the latter matrix inversion takes place in the data space, with dimension 
m. In many applications m ^ n, as the observed subspace dimension is much less than the 
state space dimension, and thus the formulation in Corollary is frequently employed in 
practice. The quantity is referred to as the innovation at time-step j + 1 and measures 
the mismatch of the predicted state from the data. The matrix it'j+i is known as the Kalman 
gain. 4|k 

The following matrix identity was used to derive the formulation of the Kalman Filter in 
which inversion takes place in the data space. 

Lemma 4.4 Woodbury Matrix Identity Let A G Wp^p,U G M.p^'^,C G and V G 

(j positive then A + UCV is invertible and 

{A + UCV)-'^ = A-^ - A-^u(^C-^ + VA-^U^ ^VA-\ 

4.2 Approximate Gaussian Filters 

Here we introduce a family of methods, based on invoking a minimization principle which 
underlies the Kalman filter, and which has a natural generalization to non-Gaussian problems. 
The update equation for the Kalman filter mean, (ITO . can be written as 

mj+i = arg min I filter (f) 

V 


where 

lfiiter(?^) := ^\yj+i - Hv\l + ^\v- ifij+ilg^^^-, (4.10) 

here fhj+i is calculated from (14.4L and Cj+i is given by (14.51) . The fact that this minimization 
principle holds follows from (14.61) . (We note that Ifiiter(-) in fact depends on j, but we suppress 
explicit reference to this dependence for notational simplicity.) 
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Whilst the Kalman filter itself is restricted to linear, Gaussian problems, the formulation 
via minimization generalizes to nonlinear problems. A natural generalization of ()4.10|1 to the 
nonlinear case is to define 

lfiiter(?^) := ^\yj+i - h{v)\l + (4.11) 


where 

fhj+i = 

and then to set 

rrij+i = argminlfiiter(?^)- 

V 

This provides a family of algorithms for updating the mean, differing depending upon how 
Cj+i is specified. In this section we will consider several choices for this specification, and 
hence several different algorithms. Notice that the minimization principle is very natural; it 
enforces a compromise between fitting the model prediction fhj+i and the data yj+i- 

For simplicity we consider the case where observations are linear and h(v) = Hv leading 
to the update algorithm mj i—> ruj+i defined by 


fhj+i = 'i’irrij) + (4.12a) 

lfiiter(?;) = ^\yj+i - Hv\l + - mj+i)||,._^^, (4.12b) 

rrij+i = argmin Ifiiter(i’)- (4.12c) 

V 

This quadratic minimization problem is explicitly solvable and, by the arguments used in 
deriving Corollary 14.21 we deduce the following update formulae: 


rrij+i = (/ - Kj+iH)fhj+i + Kj+iy^+i, (4.13a) 

(4.13b) 

Sj+i = HCj+iH^ + r. (4.13c) 

The next three subsections each correspond to algorithms derived in this way, namely by 
minimizing Ifiiter(i’), but corresponding to different choices of the model covariance Cj+i. We 
also note that in the first two of these subsections we choose = 0 in equation (j4.12b .') so 
that the prediction is made by the noise-free dynamical model; however is not a necessary 
choice and whilst it is natural for the extended Kalman filter for 3DVAR including random 
effects in the (14.12b .') is also reasonable in some settings. Likewise the ensemble Kalman filter 
can also be implemented with noise-free prediction models. 

We refer to these three algorithms collectively as approximate Gaussian filters. This 
is because they invoke a Gaussian approximation when updating the estimate of the signal 
via (j4.12b i. Specifically this update is the correct update for the mean if the assumption that 
F{vj+i\Yj) = N(rhj+i), Cj+i) is invoked for the prediction step . In general the approximation 
implied by this assumption will not be a good one and this can invalidate the statistical 
accuracy of the resulting algorithms. However the resulting algorithms may still have desirable 
properties in terms of signal estimation; in subsection 14.4.21 we will demonstrate that this is 
indeed so. 
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4.2.1. 3DVAR 


This algorithm is derived from (14.1311 by simply hxing the model covariance Cj+i = C for all 
j. Thus we obtain 


mj+i = (4.14a) 

ruj+i = (/ - KH)fhj+i + Kyj+i, (4.14b) 

K = S = HCH'^ + r. (4.14c) 

The nomenclature 3DVAR refers to the fact that the method is variational (it is based on the 
minimization principle underlying all of the approximate Gaussian methods), and it works 
sequentially at each fixed time j ; as such the minimization, when applied to practical physical 
problems, is over three spatial dimensions. This should be contrasted with 4DVAR which 
involves a minimization over all spatial dimensions, as well as time - four dimensions in all. 

We now describe two methodologies which generalize 3DVAR by employing model covari¬ 
ances which evolve from step j to step j + I: the extended and ensemble Kalman filters. We 
present both methods in basic form but conclude the section with some discussion of methods 
widely used in practice to improve their practical performance. 


4.2.2. Extended Kalman Filter 

The idea of the extended Kalman filter (ExKF) is to propagate covariances according to the 
linearization of (EH), and propagate the mean, using (12.311 . Thus we obtain, from modihcation 
of Corollary 14. 2 1 and (14.411 . (I4!5I1 

[ fnj+i = 'i'imj), 

Cj+i = D4'(mj)CjT?4t(mj)^-b E. 

' ^,+1 =Hd,+iH^ + T, 

rrij+i = (/ - Kj+iH)jfij+i + Kj+iyj+i, 

. Q+i = {I - 


The ensemble Kalman filter (EnKF) generalizes the idea of approximate Gaussian hlters in 
a significant way: rather than using the minimization procedure (14.121) to update a single 
estimate of the mean, it is used to generate an ensemble of particles which all satisfy the 
model/data compromise inherent in the minimization; the mean and covariance used in the 
minimization are then estimated using this ensemble, thereby adding further coupling to the 
particles, in addition to that introduced by the data. 

The EnKF is executed in a variety of ways and we start by describing one of these, the 
perturbed observation EnKF: 


Prediction 


Analysis 


4.2.3. Ensemble Kalman 


Prediction 


Cj +1 = Eu=i -fhj+if. 
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= HCj+iH'^ + r, 

= 

= (/ - n = 1,..., N, 

= yj+i+Vj%, n = l, ...,N. 

Here are i.i.d. draws from iV(0,r) and are i.i.d. draws from iV(0,i;). Perturbed 
observation refers to the fact that each particle sees an observation perturbed by an inde¬ 
pendent draw from iV(0, P). This procedure gives the Kalman Filter in the linear case in 
the limit of infinite ensemble. Even though the algorithm is motivated through our general 
approximate Gaussian filters framework, notice that the ensemble is not prescribed to be 
Gaussian. Indeed it evolves under the full nonlinear dynamics in the prediction step. This 
fact, together with the fact that covariance matrices are not propagated explicitly, other than 
through the empirical properties of the ensemble, has made the algorithm very appealing to 
practitioners. 

Another way to motivate the preceding algorithm is to introduce the family of cost func¬ 
tions 

Ifilter.n(l^) := “ Hv\^ + ]^\v - (4.15) 

The analysis step proceeds to determine the ensemble by minimizing Ifiiter.n with 

n = 1, • • • ,iV. The set is found from running the prediction step using the fully 

nonlinear dynamics. These minimization problems are coupled through C^+i which depends 
on the entire set of The algorithm thus provides update rules of the form 

^ ^ (4.16) 

defining approximations of the prediction and analysis steps respectively. 

It is then natural to think of the algorithm making the approximations 

1 TV ^ N 

~ > ^3+1 ~ • (4-17) 

n—1 n—1 

Thus we have a form of Monte Carlo approximation of the distribution of interest. However, 
except for linear problems, the approximations given do not, in general, converge to the true 
distributions y,j and 'jlj as N ^ oo. 

4.2.4. Square Root Ensemble Kalman Filters 

We now describe another popular variant of the EnKF. The idea of this variant is to define 
the analysis step in such a way that an ensemble of particles is produced whose empirical 
covariance exactly satishes the Kalman identity 

C,+^={I-K,+,H)d,+, (4.18) 

which relates the covariances in the analysis step to those in the prediction step. This is done 
by mapping the mean of the predicted ensemble according to the standard Kalman update, 
and introducing a linear deterministic transformation of the differences between the particle 
positions and their mean to enforce (14.181) . Doing so eliminates a sampling error inherent in 
the perturbed observation approach. The resulting algorithm has the following form: 


Analysis < 


< 5 , + ! 

Kj+i 

(n) 
^J + 1 
(n) 

Vj+I 
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Prediction 


Analysis 


A") 

^y+i 

II 

-b 

CTIL 

s 

II 

mj+i 

_ 1 ^(n) 

“ N Z^n=l ^i+1’ 

Cj+i 

= E7=i(^£\ - - fhj+i)'^ 


= Hdj+iH'^ + r, 


= a,+ii7^Y7i- 

rrij+i 

= (7 Aj_|_i77)771j-|-i -b 

(n) 

^J + l 

— + Cj+i- 


Here the {Cj+i}^=i are designed to have sample covariance Cj+i = {I — Kj+iH)Cj+i. There 
are several ways to do this and we now describe one of them, referred to as the ensemble 
transform Kalman filter (ETKF). 

If we define 

A 1 

then Cj+i = We now seek a transformation Tj+i so that, if Xj+i = 

then 

Q+i := F+iAj+i = (7 - 77,+i77)a,+i. (4.19) 

Note that the Xj+i (resp. the Xj+i) correspond to Cholesky factors of the matrices Cj+i 
(resp. Cj+i) respectively. We may now define the {Cj+i}^=i by 


- rrij+i ,..., - TTij+i 


Xj+i = 


^/W^ 


a(1) AN) 

s+1’ • ■ • ’ s+1 


We now demonstrate how to find an appropriate transformation We assume that Tj+i is 
symmetric and positive-definite and the standard matrix square-root is employed. Choosing 


Tj+i - 


I+{HX,+ifr-\HX,+i) 


we see that 


^.+1^7+1 




= A 


i-ei 


-^j+i 


/+(77A,+i)^r-i(77A,+i) 

I - (i7A,+i)^ [(i7A,+i)(i7A,+i)^ + r 


= (7 - 7C,+i77)a,+i 


-1 


(77A,+i) Aj+, 


as required, where the transformation between the second and third lines is justified by 

Lemma [4.41 It is important to ensure that 1, the vector of all ones, is an eigenvector of the 

1 

transformation Tj+i, and hence of so that the mean of the ensemble is preserved. This 
is guaranteed by Tj+i as defined. 
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4.3 The Particle Filter 


In this section we introduce an important class of filtering methods known as particle filters. 
In contrast to the filters introduced in the preceding section, the particle filter can be proved 
to reproduce the true posterior filtering distribution in the large particle limit and, as such, 
has a privileged places amongst all the filters introduced in this book. We will describe the 
method in its basic form - the bootstrap filter ~ and then give a proof of convergence. It is 
important to appreciate that the form of particle filter introduced here is far from state-of-the- 
art and that far more sophisticated versions are used in practical applications. Nonetheless, 
despite this sophistication, particle filters do not perform well in applications such as those 
arising in geophysical applications of data assimilation, because the data in those applications 
places very strong constraints on particle locations, making efficient algorithms very hard to 
design. It is for this reason that we have introduced particle filters after the approximate 
Gaussian filters introduced in the preceding section. The filters in the preceding section 
tend to be more robust to data specifications. However they do all rely on the invocation 
of ad hoc Gaussian assumptions in their derivation and hence do not provably produce the 
correct posterior filtering distribution, notwithstanding their ability, in partially observed 
small noise scenarios, to correctly identify the signal itself, as in Theorem 14.101 Because 
it can provably reproduce the correct filtering distribution, the particle filter thus plays an 
important role, conceptually, even though it is not, in current form, a practical algorithm 
in geophysical applications. With further improvements it may, in time, form the basis for 
practical algorithms in geophysical applications. 


4.3.1. The Basic Approximation Scheme 

All probability measures which possess density with respect to Lebesgue measure can be 
approximated by a finite convex combination of Dirac probability measures; an example of 
this is the Monte Carlo sampling idea that we described at the start of Ghapter[3l and also 
underlies the ensemble Kalman filter of subsection l4.2.31 In practice the idea of approximation 
by a convex combination of probability measures requires the determination of the locations 
and weights associated with these Dirac measures. Particle filters are sequential algorithms 
which use this idea to approximate the true filtering distribution ]P{vj\Yj). 

Basic Monte Carlo, as in m, and the ensemble Kalman filter, as in (j4.17|l . correspond to 
approximation by equal weights. Recall pj, the probability measure on K” corresponding to 
the density F{vj\Yj), and Pj+i, the probability measure on R” corresponding to the density 
¥{vj+i\Yj). The basic form of the particle filter proceeds by allowing the weights to vary and 
by finding A^-particle Dirac measure approximations of the form 




N 

N (") c 


N 


(n), 




-iV 




(4.20) 


The weights must sum to one. The approximate distribution pj is completely defined by 
particle positions and weights , and the approximate distribution p^.^i is completely 

defined by particle positions and weights Wj+i- Thus the objective of the method is to 
find update rules 


^ j Jra=l ^ ^j + l Ira=l’ 


^ (4.21) 


defining the prediction and analysis approximations respectively; compare this with (I4.16|) 
for the EnKF where the particle weights are uniform and only the positions are updated. 


85 










Defining the updates for the particle filter may be achieved by an application of sampling, 
for the prediction step , and of Bayesian probability, for the analysis step. 

Recall the prediction and analysis formulae from (12.3011 and ()2.31|1 which can be summa¬ 
rized as 


Hvj+i\Yj)= j f‘{vj+i\vj)¥{vj\Yj)dvj, 

(4.22a) 


(4.22b) 

We mav rewrite (14.221) as 


Mi-ei(-) = (-P/^iK-) := / 

(4.23a) 

dyj+i, , ¥{yj+i\vj+i) 

P(y,+i|iS-) ■ 

(4.23b) 


Writing the update formulae this way is important for us because they then make sense in 
the absence of Lebesgue densities; in particular we can use them in situations where Dirac 
masses appear, as they do in our approximate probability measures. The formula (I4.23b|) 
for the density or Radon-Nikodym derivative of with respect to that of Pj+i has a 

straightforward interpretation: the righthand-side quantifies how to reweight expectations 
under ^j+i so that they become expectations under Pj+i. To be concrete we may write 


4.3.2. Sequential Importance Resampling 


The simplest particle filter, which is based on sequential importance resampling is now de¬ 
scribed. We start by assuming that we have an approximation given by (14.201) and explain 

how to evolve the weights into 


dmi). 


D'+l 


Jn=l 


as 


Prediction In this step we approximate the prediction phase of the Markov chain. To do 
this we simply draw from the kernel p of the Markov chain (j2.1t ii started from 
Thus the relevant kernel isp(?;j,?;„+i) =P(nj+i|?;j). We then have We leave 

the weights of the approximation unchanged so that . From these new particles 

and (in fact unchanged) weights we have the particle approximation 


^f+i 


n—1 


.(n) . 

i+1 


(4.24) 


Analysis In this step we approximate the incorporation of data via Bayes’ formula. Define 
9j{v) by 

Pjivj+i) oc F(yj+i\vj+i), (4.25) 


where the constant of proportionality is, for example, the normalization for the Gaussian, 
and is hence independent of both y^+i and Vj+i. We now apply Bayes’ formula in the form 
(j4.23b() . Thus we obtain 


N 




„(n) 




(4.26) 
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where 


»'S=”SV (£””)■ (4.27) 

The first equation in the preceding is required for normalization. Thus in this step we do not 
change the particle positions, but we reweight them. 

Resampling The algorithm as described is deficient in two regards, both of which can be 
dealt with by introducing a re-sampling step into the algorithm. Firstly, the initial measure 
fj,Q for the true hltering distribution will not typically be made up of a combination of Dirac 
measures. Secondly, the method can perform poorly if one of the particle weights approaches 
1 (and then all others approach 0). The effect of the first can be dealt with by sampling 
the initial measure and approximating it by an equally weighted (by N~^) sum of Dirac 
measures at the samples. The second can be ameliorated by drawing a set of N particles 
from the measure (|4.26l) and assigning weight N~^ to each; this has the effect of multiplying 
particles with high weights and killing particles with low weights. 

Putting together the three preceding steps leads to the following algorithm; for notational 
convenience we use Yq to denote the empty vector (no observations at the start): 

1. Set j = 0 and (dvo) = ^o{dvo). 

2. Draw ~ , n = 1,. .. ,N. 

3. Set = 1/iV, n = 1,..., iV; redefine := Yln=i ■ 

4. Draw ~ p(uj"^|-). 

5. Define by (|4.27p and := E!Li • 

6- 4 + 1 j- 

7. Go to step 2. 

This algorithm is conceptually intuitive, proposing that each particle moves according to 
the dynamics of the underlying model itself, and is then re-weighted according to the likelihood 
of the proposed particle, i.e. according to the data. This sequential importance resampling 
filter is also sometimes termed the bootstrap filter. We will comment on important im¬ 
provements to this basic algorithm in the the following section and in the bibliographic notes. 
Here we prove convergence of this basic method, as the number of particles goes to infinity, 
thereby demonstrating the potential power of the bootstrap filter and more sophisticated 
variants on it. 

Recall that, by (I2.32L the true filtering distribution simply satisfies the iteration 

/Tj+i = LjPfij, fio = fV(mo,C'o), (4.28) 

where P corresponds to moving a point currently at v according to the Markov kernel p{-\v) 
describing the dynamics given by (I2.1b .l and Lj denotes the application of Bayes’ formula with 
likelihood proportional to gj{-) given by (14.251) . Recall also the sampling operator defined 
by dSII]). ft is then instructive to write the particle filtering algorithm which approximates 
(j4.28|) in the following form: 

A^f+i = L.S^Ppf, = MO. (4.29) 
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There is a slight trickery here in writing application of the sampling after application of 
P, but some reflection shows that this is well-justified: applying P followed by can be 
shown, by first conditioning on the initial point and sampling with respect to P, and then 
sampling over the distribution of the initial point, to be the algorithm as defined. 

Comparison of (14.281) and (14.291) shows that analyzing the particle filter requires estimation 
of the error induced by application of (the resampling error) together with estimation of 
the rate of accumulation of this error in time under the application of Lj and P. We now 
build the tools to allow us to do this. The operators Lj,P and map the space P(R") of 
probability measures on R." into itself according to the following: 


Jr- 9 j[v)p{dv) 

(4.30a) 

{Pg){dv)=( p{v',dv)g.{dv'), 

(4.30b) 

1 ^ 

{S^g,){dv) = ~ ^ ' 

n—1 

(4.30c) 

Notice that both Lj and P are deterministic maps, whilst is random. 

Let = p,ui denote, 


for each w, an element of 7^(R"). If we then assume that a; is a random variable describing the 
randomness required to define the sampling operator ^ and let denote expectation over 
uj, then we may define a “root mean square” distance d(-, •) between two random probability 
measures pLu},Vt^, as follows: 

d{pL, v) = sup|^|^<i VE‘^|/x(/) - i/(/)p. 

Here we have used the convention that /r(/) = /]j„ f{v)g,{dv) for measurable / : R" —>■ R, and 
similar for v. Furthermore 

I/loo = sup|/(m)|. 

U 

This distance does indeed generate a metric and, in particular, satisfies the triangle inequality. 
Note also that, in the absence of randomness within the measures, the metric satisfies d{fi, v) = 
2c?tv(Mj^)j by (11.121) : that is, it reduces to the total variation metric . In our context the 
randomness within the probability measures comes from the sampling operator used to 
define the numerical approximation. 


Theorem 4.5 JVe assume in the following that there exists k G (0,1] such that for all v € R" 
and j GN 

K < gj{v) < 


Then 


d{p,^,^ij) <Y^{2 k 
i=i 


1 

y/N' 


Proof The desired result is proved below in a straightforward way from the following three 
facts, whose proof we postpone to three lemmas at the end of the section: 


sup d{S^n,g.) < ^=, (4.31a) 

d{Pv,Pg) < d{v,pL), (4.31b) 

d{Ljn, LjpL) < 2K,~^d{v, pi). (4.31c) 
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By the triangle inequality we have, for , 

< d{LjPfif, LjPfij) + d{LjS^P^i^,LjP^i^) 

< 2k-'^ (d{^j.f, Hj) + d{S^vf ,vf)^ 

<2K-^(d{^l^,^l,) + ^Y 

Iterating, after noting that = /tq, gives the desired result. □ 

Remark 4.6 This important theorem shows that the particle filter reproduces the true filter¬ 
ing distribution, in the large particle limit. We make some comments about this. 

• This theorem shows that, at any fixed discrete time j, the filtering distribution pj is 
well-approximated by the bootstrap filtering distribution in the sense that, as the 
number of particles N —>■ oo, the approximating measure converges to the true measure. 
However, since n <1, the number of particles required to decrease the upper bound on 
the error beneath a specified tolerance grows with J. 

• If the likelihoods have a small lower bound then the constant in the convergence proof 
may be prohibitively expensive, requiring an enormous number of particles to obtain a 
small error. This is similar to the discussion concerning the Independence Dynamics 
Sampler in section \K^ where we showed that large values in the potential 4* lead to slow 
convergence of the Markov chain, and the resultant need for a large number of samples. 

• In fact in many applications the likelihoods pj may not be bounded from above or below, 
uniformly in j, and more refined analysis is required. However, if the Markov kernel P 
is ergodic then it is possible to obtain bounds in which the error constant arising in the 
analysis has milder growth with respect to J. 

• Considering the case of deterministic dynamics shows just how difficult it may be to 

make the theorem applicable in practice: if the dynamics is deterministic then the orig¬ 
inal set of samples from po, give rise to a set of particles 

in other words the particle positions are unaffected by the data. This is clearly a highly 
undesirable situation, in general, since there is no reason at all why the pushforward 
under the dynamics of the initial measure po should have substantial overlap with the 
filtering distribution for a given fixed data set. Indeed for chaotic dynamical systems 
one would expect that it does not as the pushforward measure will be spread over the 
global attractor, whilst the data will, at fixed time, correspond to a single point on the 
attractor. This example motivates the improved proposals of the next section. 


Before describing improvements to the basic particle filter, we prove the three lemmas 
underlying the convergence proof. 

Lemma 4.7 The sampling operator satisfies 

sup d{S^p,p) < 


y/N' 
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Proof Let ^ be an element of P(]R"') and i-i.d. with ^ v. In this proof the 

randomness in the measure arises from these samples and expectation over this 

randomness is denoted E. Then 

1 ^ 


and, defining f = f — v{f), we deduce that 

n—l 

It is straightforward to see that 

E/ 1 = <5niE |7 


N 


Furthermore, for \f\ao < 1) 


E 


/(u«)|'=e|/(u«)|"-|e/ (u«) 


It follows that, for |/|oo < 1, 

E \uif) - S^u{f)f = ^ ^ E |7 (u(-)) 


N 


< 1 . 


2 I 
< —. 

- N 


Since the result is independent of v we may take the supremum over all probability measures 
and obtain the desired result. □ 

Lemma 4.8 Since P is a Markov kernel we have 

d{Pi^,Pi^') < d{v,v'). 


Proof Define 

q{v') = ( p{v',v)f{v)dv = E(ui|uo = v'), 
jR" 

that is the expected value of / under one-step of the Markov chain given by (|2.1h l. started 
from v'. Clearly, for |/|oo < I, 


Thus 

Note that 


\q{v')\< [ p{v',dv)\f{v) < 

jR" 


p{v', dv) = 1. 


|g|oo < sup |( 3 '(u)| < I. 

V 


v{q) = E(ui|uo 



([ p{v',v)n{dv')) 


p{v', v)f{v)v(dv') dv 
f{v)dv = Pu{f). 
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Thus Pv{f) = v{q) and it follows that 


\Pv{f)-Py'{f)\ = \u{q)-v'{q)\. 

Thus 

d{Pv,Pv')= sup (¥P\Pv{f)-Pv'{f)\'^ 

< sup (wPHq)-v’{q)\'^y 
= d{v, v') 


as required. 


□ 


Lemma 4.9 Under the Assumptions of Theorem WR we have 

d{LjV,Ljp) < 2K~‘^d{y, p). 


Proof Notice that for |/|oo < c» we can rewrite 


(L,u.)(/) - (L,/i)(/) 


^(/gj) _ g(/gj) 

Hdj) g(gj) 

I'ifdj) _ g(/gj) g(/gj) 

v{gj) v{gj) 

^-1 

-^ji^yy^Kfgj) - p[Kfgj)] - 


Tifdj) 


M(gj) 

uifgj) 

g(gj) 






(4.32a) 

(4.32b) 

(4.32c) 


Now notice that v{gj)~^ < and that pifgj)/p{gj) < 1 since the expression corresponds 
to an expectation with respect to measure found from p by reweighting with likelihood pro¬ 
portional to gj. Thus 

\{Ljv){f) - {Ljp){f)\ < K-^\iy{Kfgj) - pinfgj)] + K-^\i^{Kgj) - p{Kgj)\. 


Since \Kgj\oo < 1 it follows that \Kfgj\oo and hence that 

^‘^\iLjiy){f) - (Ljm)(/)P < 4k-^ sup E^\v{h) - p{h)\^. 

\h\^<i 


The desired result follows. □ 

4.3.3. Improved Proposals 

In the particle hlter described in the previous section we propose according to the underlying 
unobserved dynamics, and then apply Bayes’ formula to incorporate the data. The final point 
in Remarks 14.61 demonstrates that this may result in a very poor set of particles with which 
to approximate the filtering distribution. Cleverer proposals, which use the data, can lead to 
improved performance and we outline this methodology here. 

Instead of moving the particles according to the Markov kernel P, we use a 

Markov kernel Qj with density Q(uj+i|uj, Y^+i). The weights are found, as before, by 
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applying Bayes’ formula for each particle, and then weighting appropriately as in (I4.27|) : 



(4.33a) 

(4.33b) 


The choice 

results in the bootstrap filter from the preceding subsection. In the more general case the 
approach results in the following algorithm: 


1. Set j = 0 and fj,Q {vo)dvQ = ¥{vo)dvQ. 

2. Draw ~ , n = 1,. .. ,N. 

3. Set = 1/iV, n = l,...,iV. 

4. Draw - Q(-|u]”\,Y,+i). 

5. Define wj+i by (14.331) and = P'^(uj+i|Y,+i) by (14.261) . 

6. i + 1 -t j. 

7. Go to step 2. 


We note that the normalization constants in (I4.33b ,f , here assumed known in the definition 
of the reweighting, or not of course needed. The so-called optimal proposal is found by 
choosing 

Q (vj+i , Fj+i) = P (uj+i 

which results in 

. (4.34) 

The above can be seen by observing that the definition of conditional probability gives 

p(:/,«l«S)i-(«Sl»r) = P(:/,«.s):>.lT’) 

= p (’^j+ii^^Gj+i) (yj+ii4"’) • 

Substituting the optimal proposal into (j4.33p then immediately gives (14.341) . 

This small difference from the bootstrap filter may seem trivial at a glance, and at the 
potentially large cost of sampling from Q. However, in the case of nonlinear Gaussian Markov 
models as we study here, the distribution and the weights are given in closed form. If the 
dynamics is highly nonlinear or the model noise is larger than the observational noise then the 
variance of the weights for the optimal proposal may be much smaller than for the standard 
proposal. The corresponding particle filter will be referred to with the acronym SIRS(OP) to 
indicate the optimal proposal . For deterministic dynamics the optimal proposal reduces to 
the standard proposal. 
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4.4 Large-Time Behaviour of Filters 


With the exception of the Kalman filter for linear problems, and the particle filter in the 
general case, the filtering methods presented in this chapter do not, in general, give accurate 
approximations of the true posterior distribution; in particular the approximate Gaussian 
hlters do not perform well as measured by the Bayesian quality assessment test of section BT71 
However they may perform well as measured by the signal estimation quality assessment test 
and the purpose of this section is to demonstrate this fact. 

More generally, an important question concerning filters is their behaviour when iterated 
over long times and, in particular, their ability to recover the true signal underlying the data 
if iterated for long enough, even when initialized far from the truth. In this section we present 
some basic large time asymptotic results for filters to illustrate the key issue which affects 
the ability of filters to accurately recover the signal when iterated for long enough. The main 
idea is that the data must be sufficiently rich to stabilize any inherent instabilities within the 
underlying dynamical model (EU; in rough terms it is necessary to observe only the unstable 
directions as the dynamics of the model itself will enable recovery of the true signal within 
the space spanned by the stable directions. We illustrate this idea first, in subsection 14.4.11 
for the explicitly solvable case of the Kalman filter in one dimension, and then, in subsection 
14.4.21 for the 3DVAR method. 


4.4.1. The Kalman Filter in One Dimension 


We consider the case of one dimensional dynamics with 

= An, h{v) = v, 

while we will also assume that 

s = a^ r = 7F 

With these definitions equations (I4.3h .bl become 

1 1 1 


Cj+l 

rrij+i 


+ X^Cj 7^ ’ 


Am, 


Cj+i 0-2 + X^Cj 7 ' 

which, after some algebraic manipulations, give 

Cj+i = 


1 


rrij+i = 1 


_ 0+1 ^ 


Am, 


0+1 


^2 J ''"-3 ^2 


Vj+i, 


(4.36a) 

(4.36b) 


(4.37a) 

(4.37b) 


where we have defined 

7^(A^c + (T^) 
7^ + A^c + cr^ 


(4.38) 


We wish to study the behaviour of the Kalman filter as j —> oo, i.e. when more and more 
data points are assimilated into the model. Note that the covariance evolves independently 
of the data {yj}jei,+ and satishes an autonomous nonlinear dynamical system. However it is 
of interest to note that, if cr^ = 0, then the dynamical system for c~^ is linear. 

We now study the asymptotic properties of this map. The fixed points c* of (I4.37h l satisfy 


. 7^(AV + afy 

7^ + A^c* + cr^ ’ 


(4.39) 
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and thus solve the quadratic equation 

+ (7^(1 - A^) + a ^) c * - = 0. 

We see that, provided Aqcr ^ 0, one root is positive and one negative. The roots are given by 


c± = 


-(72 + - 72A2) ± v'( 72 + - 7^A2)2 + 4A272a2 

2 A 2 


(4.40) 


We observe that the update formula for the covariance ensures that, provided cq > 0 then 
Cj > 0 for all j G N. It also demonstrates that Cj < 72 for all j G Z+ so that the variance 
of the filter is no larger than the variance in the data. We may hence fix our attention on 
non-negative covariances, knowing that they are also uniformly bounded by 72 . We will now 
study the stability of the non-negative fixed points . 

We first start with the case tr = 0, which corresponds to deterministic dynamics, and for 
which the dynamics of c~^ is linear. In this case we obtain 


= 0 , 


72(A2-I) 

A 2 


and 

g'(cl) = A^ g'(cl) = A- 2 , 


which implies that when A2 < I, is an asymptotically stable fixed point, while when 
A2 > 1, c* is an asymptotically stable fixed point . When |A| = 1 the two roots are coincident 
at the origin and neutrally stable. Using the aforementioned linearity, for the case cr = 0 it is 
possible to solve (I4.36b l to obtain for A2 ^ I 


/ly 1 I 



. 7^ “ ^ . 


(4.41) 


This explicit formula shows that the fixed point (resp. c* ) is globally asymptotically stable, 
and exponentially attracting on 1R+, when A2 < 1 (resp. A2 > 1 ). Notice also that c* = O(j^) 
so that when A 2 > 1 , the asymptotic variance of the filter scales as the observational noise 
variance. Furthermore, when A2 = 1 we may solve (j4.36k l to obtain 

1-1 1 

Cj Co T 


showing that ck = = 0 is globally asymptotically stable on R+, but is only algebraically 

attracting. 

We now study the stability of the fixed points and c* in the case of 0-2 >0 corresponding 
to the case where the dynamics are stochastic. To this end we prove some bounds on g'{c*) 
that will also be useful when we study the behaviour of the error between the true signal 
and the estimated mean; here, and in what follows in the remainder of this example, prime 
denotes differentiation with respect to c. We start by noting that 


g{c) = 7 ^ 


72 -I- A2c -I- 0-2 ’ 


(4.42) 


and so 


g'{c) = 


2^,4 


A^7 


(72 -I- A2c -I- 0-2)2 
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Using the fact that c* satisfies (14.391) together with equation (14.421) . we obtain 


5'(c*) = 


1 (c*)^ 


and g'{c*) = ( 1 -^ 


^h<='+S)‘ 

We can now see that from the first equation we obtain the following two bounds, since > 0: 

g'{c*) < A“^, for A G R, and g'{c*) < 1, for A^ = 1, 
while from the second equality and the fact that, since c* satisfies (I4.39L c* < 7 ^ we obtain 

g'{c*) < W 

when c* > 0. We thus conclude that when tr^ > 0 the fixed point c\ of (Id.dTb i is always 
stable independently of the value of the parameter A. 



Limiting covariance for = 0 

Limiting covariance for > 0 

|A|<1 

Cj —>■ 0 (exponentially) 

Cj —^ = 0 ( 7 ^) (exponentially) 

A = 1 

Cj 0 (algebraically) 

Cj —>■ = 0('y"‘) (exponentially) 

|A|>1 

Cj —c* = 0 ( 7 ^) (exponentially) 

Cj —^c^ = 0 ( 7 ^) (exponentially) 


Table 4.1; Summary of the limiting behaviour of covariance Cj for Kalman filter applied to 
one dimensional dynamics 




(a) O' = 0, A = 0.8 ,7 = 1 (b) (j = 0, A = 1,7 = 1 

Figure 4.1: Cobweb diagram for equation (I4.37b .l 

Table 14.11 summarises the behaviour of the variance of the Kalman filter in the case of 
one-dimensional dynamics. This is illustrated further in Figures |4.1l and 14.21 where we plot 
the cobweb diagram for the map (j4.42|) . In particular, in Figure |4T] we observe the difference 
between the algebraic and geometric convergence to 0, for different values of A in the case 
(7 = 0, while in Figure lT^ we observe the exponential convergence to c* for the case of | A| > 1. 
The analysis of the error between the mean and the truth underlying the data is left as an 
exercise at the end of the chapter. This shows that the error in the mean is, asymptotically, 
of order 7 ^ in the case where ct = 0 . 
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(a) o- = 0, A = 1.2 ,7 = 1 (b) (7 = 0.5, A = 1.2 ,7 = 1 

Figure 4.2: Cobweb diagram for equation (I4.37b .i 


4.4.2. The 3D VAR Filter 


In the previous subsection we showed that the Kalman filter accurately recovers any one¬ 
dimensional signal, provided the observational noise is small. The result allows for initializa¬ 
tion far from the true signal and is, in this sense, quite strong. On the other hand being only 
one-dimensional it gives a somewhat limited picture. In this subsection we study the 3DVAR 
filter given by (j4.14ll . We study conditions under which the 3DVAR filter will recover the 
true signal, to within a small observational noise level of accuracy, in dimensions bigger than 
one, and when only part of the system is observed. 

To this end we assume that 

yj+i = + Cj (4.43) 

where the true signal satisfies 

= 4 /( 1 ;]), j e N (4.44a) 

vl = u (4.44b) 

and, for simplicity, we assume that the observational noise satisfies 


sup|ej|=e. (4.45) 

jGN 


We have the following result. 

Theorem 4.10 Assume that the data is given by (I4.43p . where the signal follows eguation 
(14.441) and the error in the data satisfies (|4.45p . Assume furthermore that C is chosen so that 
(/ — KH)'^ : R" —>■ R" is globally Lipschitz with constant a < 1 in some norm || • ||- Then 
there is constant c > 0 such that 


limsup \ \mj — Vj 

3^00 



e. 


Proof We may write (|4.14L (14.441) . using (14.431) . as 

TOj+i = {I- KiJ)4'(TOj) -f KH^{v]) + Kcj 
= {I- KiJ)4'(z;]) -b KH'i>{v}). 
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Subtracting, and letting Cj = rrij — vj gives, for some finite constant c independent of j, 

||e,+i|| < \\{I-KH)^{m,)-{I-KH)^{v])\\ + \\Ke,\\ 

< a||ej|| + ce. 

Applying the Gronwall Lemma Fl . 141 gives the desired result. □ 

We refer to a map with Lipschitz constant less than 1 as a contraction in what follows. 


Remark 4.11 The preceding simple theorem shows that it is possible to construct filters which 
can recover from being initialized far from the truth and lock-on to a small neighbourhood of 
the true signal underlying the data, when run for long enough. Furthermore, this can happen 
even when the system is only partially observed, provided that the observational noise is small 
and enough of the system is observed. This concept of observing “enough” illustrates a key idea 
in filtering: the guestion of whether the fixed model covariance in 3DVAR, C, can be chosen 
to make (/ — KH)’^ into a contraction involves a subtle interplay between the underlying 
dynamics, encapsulated in and the observation operator H. In rough terms the question 
of making (/ — KH)’^ into a contraction is the question of whether the unstable parts of the 
dynamics are observed; if they are then it is typically the case that C can be designed to obtain 
the desired contraction. ^ 


Example 4.12 Assume that H = I, so that the whole system is observed, that F = and 
C = a^I. Then, for rfi = 


S={a^+^^)I, K = 


(a2+72) 


and 


(/ - KH) = 




-I = 


(1 + r]^) 


I. 


Thus, if: H.” —>■ R” is globally Lipschitz with constant X > 0 in the Euclidean norm, | • |, 

2 \ 

then (/ — KH)^ is globally Lipschitz with constant a < 1, if p is chosen so that < 1- 
Thus, by choosing rj sufficiently small the filter can be made to contract. This corresponds to 
trusting the data sufficiently in comparison to the model. It is a form of variance inflation 
in that, for given level of observational noise, ig can be made sufficiently small by choosing 
the model variance scale cr^ sufficiently large - “inflating” the model variance. 4|k 


Example 4.13 Assume that there is a partition of the state space in which H = (1,0)'^, so 
that only part of the system is observed. Set F = 7 ^/ and C = tr^I. Then, with rj as in the 
previous example. 


I -KH 


i+v' 

0 


rl 0 


Whilst the previous example shows that variance inflation may help to stabilize the filter, 
this example shows that, in general, more is required: in this case it is clear that making 
{I — KH)'^{-) into a contraction will require a relationship between the subspace in which we 
observe and the space in which the dynamics of the map is expanding and contracting. For 
example, if 4/ (u) = Lu and 


L = 


2 / 0 \ 
0 a/ y 
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then 


When |a| < 1 this can be made into a contraction by choosing 77 sujficiently small; but for |a| > 
1 this is no longer possible. The example thus illustrates the intuitive idea that the observations 
should he sujficiently rich to ensure that the unstable directions within the dynamics can be 
tamed by observing them. 


4.4.3. The Synchronization Filter 

A fundamental idea underlying successful filtering of partially observed dynamical systems, 
is synchronization. To illustrate this we introduce and study the idealized synchronization 
filter. To this end consider a partition of the identity P + Q = I. We write v = (p, q) where 
p = Pv,q = Qv and then, with a slight abuse of notation, write 'flu) = '^{p^q). Consider 
a true signal governed by the deterministic dynamics model (14.441) and write vj, = (pljql), 
with pI = Pvl and ql = Qvl. Then 


pI+1 = 

4+1 =q^(pL4)- 

Now imagine that we observe yk = pi exactly, without noise. Then the synchronization filter 
simply fixes the image under P to pj, and plugs this into the image of the dynamical model 
under Q; if the filter is mt = (pk, qk) with pk = Pruk and qk = Qnik then 


Pk+l — 4+1’ 
qk+i = Q4'(p1,,(?/c)- 

We note that, expressed in terms of the data, this filter has the form 

mfc+i = Q^{mk) + Pyk+i- (4.46) 

A key question now is whether or not the filter synchronizes in the following sense: 

~ 4l —>■ 0 as fc —>■ 00 . 


This of course is equivalent to 

\mk — 4l 0 /c — 00 . (4.47) 

Whether or not this happens involves, as for 3DVAR described above, a subtle interplay 
between the underlying dynamics and the observation operator, here P. The bibliography 
section 14.61 contains pointers to the literature studying this question. 

In fact the following example shows how the synchronization filter can be viewed as a 
distinguished parameter limit, corresponding to infinite variance inflation, for a particular 
family of 3DVAR hlters. 
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Example 4.14 Let H = P andT = 7 ^/. If we choose C as in Example \4 ■ 1 3\ then the 3DVAR 
filter can be written as 

mfc+i = S'^ljnk) + {I - S)yk+i, (4.48a) 

S=-^P + Q. (4.48b) 

The limit 77 0 is the the extreme limit of variance inflation referred to in ExampleIn 

this limit the 3DVAR filter becomes the synchronization filter (14.4611 . 

4.5 Illustrations 



(a) Solution. (b) Covariance. 



Figure 4.3: Kalman filter applied to the linear system of Example l2.2l with A = A 3 , H = (1,0), 
E = /, and F = 1, see also p8 .m in Section [5.2.5l The problem is initialized with mean 0 and 
covariance 10 /. 

The first illustration concerns the Kalman filter applied to the linear system of Example 
with A = A3. We assume that H = (1, 0) so that we observe only the first component of 
the system and the model and observational covariances are E = / and F = 1, where I is the 
2x2 identity. The problem is initialized with mean 0 and covariance 10 1. Figure lT3h shows 
the behaviour of the filter on the unobserved component, showing how the mean locks onto 
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^ g 3DVAR Filter Covariance, Ex 1.4 
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Figure 4.4: 3DVAR methodology applied to the logistic map Example 12.41 with r = 4, 7^ = 
10“^, and c = 7^/7 with r; = 0.2, see also p9 .m in section [5.3.1l 


a small neighbourhood of the truth and how the one-standard deviation confidence intervals 
computed from the variance on the second component also shrink from a large initial value to 
an asymptotic small value; this value is determined by the observational noise variance in the 
first component. In Figure |33b the trace of the covariance matrix is plotted demonstrating 
that the total covariance matrix asymptotes to a small limiting matrix. And finally Figure 
14.3b shows the error (in the Euclidean norm) between the filter mean and the truth underlying 
the data, together with its running average. We will employ similar figures (a), (b) and (c) 
in the examples which follow in this section. 

The next illustration shows the 3DVAR algorithm applied to the Example l2.4l with r = 2.5. 
We consider noise-free dynamics and observational variance of 7^ = 10“^. The fixed model 
covariance is chosen to be c = 'y’^/rj with rj = 0.2. The resulting algorithm performs well at 
tracking the truth with asymptotic time-averaged Euclidean error of size roughly 10“^. See 
Figure 14.41 

The rest of the figures illustrate the behaviour of the various filters, all applied to the Ex- 
ample [27^ with a = 2.5, cr = 0.3, and 7 = 1. In particular, 3DVAR/Eigure l4^ . ExKE(Eigure 
14.6|) . EnKE (Figure 14.711 . ETKF (Figure 14.811 . and the particle filter with standard (Fig¬ 
ure 14.911 and optimal (Figure 14.1011 proposals are all compared on the same example. The 
ensemble-based methods all use 100 ensemble members each (notice this is much larger than 
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Figure 4.5: 3DVAR for the sin map Example 12.31 with a = 2.5, a = 0.3,7 = r/ = 0.2, 

see also p 10 . m in Section 15.3.31 


the dimension of the state space which is n = 1 here, and so a regime outside of which the 
ensemble methods would usually be employed in practice). For 3DVAR, results from which 
(for this example) are only shown in the summary Figure 14.111 we take r/ = 0.5. 

All of the methods perform well at tracking the true signal, asymptotically in time, re¬ 
covering from a large initial error. However they also all exhibit occasional instabilities, and 
lose track of the true signal for short periods of time. From Fig. I4.6f cl we can observe that 
the ExKF has small error for most of the simulation, but that sporadic large excursions are 
seen in the error. From Fig. Id.Sf c'l one can observe that ETKF is similarly prone to small 
destabilization and local instability as the EnKF with perturbed observations in Fig. wnic). 
Also, notice from Figure I4.9f ci that the particle filter with standard proposal is perhaps 
slightly more prone to destabilization than the optimal proposal in Figure Id.lOf cb although 
the difference is minimal. 

The performance of the filters is now compared through a detailed study of the statistical 
properties of the error e = m — v\ over long simulation times. In particular we compare 
the histograms of the errors, and their large time averages. Figure [4. Ill compares the errors 
incurred by the three basic methods 3DVAR , ExKF, and EnKF, demonstrating that the 
EnKF is the most accurate method of the three on average, with ExKF the least accurate 
on average. Notice from Fig. H.llf al that the error distribution of 3DVAR is the widest. 
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(b) Covariance. 



(c) Error. 


Figure 4.6: ExKF on the sin map Example 12.31 with a = 2.5, a = 0.3, and 7 = 1, see also 
pll .m in Section [5.3.41 


and both it and EnKF remain consistently accurate. The distribution of ExKF is similar to 
EnKF, except with ”fat tails” associated to the destabilization intervals seen in Fig. 14.61 
Figure [4.121 compares the errors incurred by the four more accurate ensemble-based meth¬ 
ods EnKF, ETKF, SIRS, and SIRS(OP). The error distribution, Fig. I4.12f al of all these 
filters is similar. In Fig. I4.12l bl one can see that the time-averaged error is indistinguishable 
between EnKF and ETKF. Also, the EnKF, ETKF, and SIRS(OP) also remain more or less 
consistently accurate. The distribution of e for SIRS is similar to SIRS(OP), except with 
fat tails associated to the destabilization intervals seen in Fig. 14.91 which leads to the larger 
time-averaged error seen in Fig. I4.12l bl. In this sense, the distribution of e is similar to that 
for ExKF. 


4.6 Bibliographic Notes 

• Section 14.11 The Kalman Filter has found wide-ranging application to low dimensional 
engineering applications where the linear Gaussian model is appropriate, since its intro¬ 
duction in 1960 |65j . In addition to the original motivation in control of flight vehicles, 
it has grown in importance in the fields of econometric time-series analysis, and signal 
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Figure 4.7: EnKF on the sin map Example 12.31 with a = 2.5, a = 0.3, 7=1 and N = 100, 
see also p 12 . m in Section 15.3.51 


processing [^. It is also important because it plays a key role in the development of 
the approximate Gaussian filters which are the subject of section The idea behind 
the Kalman filter, to optimally combine model and data, is arguably one of the most 
important ideas in applied mathematics over the last century: the impact of the paper 
|65j on many applications domains has been huge. 

• Section 1121 All the non-Gaussian Eilters we discuss are based on modifying the Kalman 
filter so that it may be applied to non-linear problems. The development of new filters 
is a very active area of research and the reader is directed to the book [82], together 
with the articles jH] , [55] and |114j for insight into some of the recent developments with 
an applied mathematics perspective. 

The 3DVAR algorithm was proposed at the UK Met Office in 1986 [751 [76], and was 
subsequently developed by the US National Oceanic and Atmospheric Administration 
[95] and by the European Gentre for Medium-Range Weather Forecasts (ECMWF) in 
|32j . The perspective of these papers was one of minimization and, as such, easily 
incorporates nonlinear observation operators via the objective functional (14.111) . with 
a hxed C = Cj+i, for the analysis step of filtering; nonlinear observation operators 
are important in numerous applications, including numerical weather forecasting. In 
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Figure 4.8: ETKF on the sin map Example 12.31 with a = 2.5, tr = 0.3, 7 = 1 and N = 100, 
see also p 13 . m in Section 15.3.61 


the case of linear observation operators the objective functional is given by (I4.12|) with 
explicit solution given, in the case C = Cy+i, by (14.141) . In fact the method of optimal 
interpolation predates 3DVAR and takes the linear equations (14.141) as the starting point, 
rather than starting from a minimization principle; it is then very closely related to the 
method of krigging from the geosciences |108| . The 3DVAR algorithm is important 
because it is prototypical of the many more sophisticated filters which are now widely 
used in practice and it is thus natural to study it. 

The extended Kalman filter was developed in the control theory community and is 
discussed at length in [63]. It is not practical to implement in high dimensions, and 
low-rank extended Kalman filters are then used instead; see m for a recent discussion. 

The ensemble Kalman filter uses a set of particles to estimate covariance information, 
and may be viewed as an approximation of the extended Kalman filter, designed to 
be suitable in high dimensions. See |40| for an overview of the methodology, written 
by one of its originators, and |116) for an early example of the power of the method. 
We note that the minimization principle (14.151) has the very desirable property that 
the samples correspond, to samples of the Gaussian distribution found by 

Bayes theorem with prior Cj+i) likelihood yj^i\v. This is the idea behind the 


104 




















































































































































































































(b) Covariance. 


160 

140 

120 

100 

80 

60 

40 

20 

0, 


0 


200 





— bj - '"jf 






























_ » - - 

_ L _ 


_ i 


400 600 

iteration j 


800 


1000 


(c) Error. 


Figure 4.9: Particle Filter (standard proposal) on the sin map Example 12.31 with a = 2.5, 
a = 0.3, 7 = 1 and N = 100, see also pl4 .m in Section 15.3.71 


randomized maximum likelihood method described in [93] , and widely used in petroleum 
applications; the idea is discussed in detail in the context of the EnKF in [57]. There 
has been some analysis of the EnKE in the large sample limit; see for example [731 
[71 [55]. However, the primary power of the method for practitioners is that it seems 
to provide useful information for small sample sizes; it is therefore perhaps a more 
interesting direction for analysis to study the behaviour of the algorithm, and determine 
methodologies to improve it, for fixed numbers of ensemble members. There is some 
initial work in this direction and we describe it below. 

Note that the F appearing in the perturbed observation EnKF can be replaced by the 
sample covariance f of the and this is often done in practice. The sample 

covariance of the updated ensemble in this case is equal to (/ — Kj+iH)Cj+i where 
Kj^i is the gain corresponding to the sample covariance F. 

There are a range of parameters which can be used to tune the approximate Gaussian 
filters or modifications of those filters. In practical implementations, especially for high 
dimensional problems, the basic forms of the ExKF and EnKF as described here are 
prone to poor behaviour and such tuning is essential [66|I10]- In Examples 14. 12l and l4.131 
we have already shown the role of variance inflation for 3DVAR and this type of approach 
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Figure 4.10: Particle Filter (optimal proposal) on the sin map Example 12.31 with a = 2.5, 
a = 0.3, 7 = 1 and N = 100, see also pl5 .m in section 15.3.81 


is also fruitfully used within ExKF and EnKF. A basic version of variance inflation is 
to replace the estimate Q+i in (I4.13|) by eC + C^+i where C is a fixed covariance such 
as that used in a 3DVAR method. Introducing e € (0,1) leads, for positive-definite C, 
to an operator without a null-space and consequently to better behaviour. In contrast 
taking e = 0 can lead to singular model covariances. This observation is particularly 
important when the EnKF is used in high dimensional systems where the number of 
ensemble members, V, is always less than the dimension n of the state space. In this 
situation Cj+i necessarily has a null-space of dimension at least n — N. It can also 
be important for the ExKF where the evolving dynamics can lead, asymptotically in j, 
to degenerate Cj+i with non-trivial null-space. Notice also that this form of variance 
inflation can be thought of as using 3DVAR-like covariance updates, in the directions 
not described by the ensemble covariance. This can be beneficial in terms of the ideas 
underlying Theorem 14.101 where the key idea is that K close to the identity can help 
ameliorate growth in the underlying dynamics. This may also be achieved by replacing 
the estimate Cj+i in (14.131) by (1 -I- e)C(,+i. This is another commonly used inflation 
tactic; note, however, that it lacks the benefit of rank correction. It may therefore be 
combined with the additive inflation yielding eiC-l- (l-|-e 2 )C'j+i. More details regarding 
tuning of filters through inflation can be found in [U |42l [63l [Ml EQ] . 
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(a) Log-scale histograms. 



(b) Running average root mean square e. 


Figure 4.11: Convergence of e = m — for each filter for the sin map Example 12.31 corre- 
spending to solutions from Figs. 14.5114.6114.71 


Another methodology which is important for practical implementation of the EnKF is 
localization [^|40]. This is used to reduce unwanted correlations in Cj between points 
which are separated by large distances in space. The underlying assumption is that the 
correlation between points decays proportionally to their distance from one another, 
and as such is increasingly corrupted by the sample error in ensemble methods. The 
sample covariance is hence modified to remove correlations between points separated 
by large distances in space. This is typically achieved by composing the empirical 
correlation matrix with a convolution kernel. Localization can have the further benefit 
of increasing rank, as for the first type of variance inflation described above. An early 
reference illustrating the benefits and possible implementation of localization is [58] . An 
important reference which links this concept firmly with ideas from dynamical systems 
is [94|. 

Following the great success of the ensemble Kalman filter algorithm, in a series of papers 
|110L[T51 151 1118] . the square-root filter framework was (re)discovered. The idea goes back 
to at least |S|. We focused the discussion above in section 14.2.41 on the ETKF, but we 
note that it is possible to derive different transformations. For example, the singular 
evolutive interpolated Kalman (SEIK) hlter proceeds by first projecting the ensemble 
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Figure 4.12: Convergence oi e = m — for both versions of EnKF in comparison to the 
particle filters for the sin map Ex. 1.3, corresponding to solutions from Figs. 14.7114.8114.91 
and 14.101 


into the (K—l)-dimensional mean-free subspace, and then identifying a {K—1) x {K—1) 
matrix transformation, effectively prescribing a K x [K — 1) matrix transformation Lj 
as opposed to the K x K rank {K — 1) matrix Tj ' proposed in ETKF. The former 
is unique up to unitary transformation, while the latter is unique only up to unitary 
transformations which have 1 as eigenvector. Other alternative transformations may 
take the forms Aj or Kj such that Xj = AjXj or Xj = {I — KH)Xj. These are known 
as the ensemble adjustment Kalman filter (EAKF) and the ensemble square-root filter 
(ESRF) respectively. See [18] for details about the ETKF, [3] for details about the 
EAKF and [118] for details about the ESRF |118j . A review of all three is given in 
m- The similar singular evolutive interpolated Kalman (SEIK) filter was introduced 
in [^ and is compared with the other square root filters in [52] • Other ensemble-based 
filters have been developed in recent years bridging the ensemble Kalman filter with the 
particle filter, for example [57l [56l [Ml I102j . 

• Section 031 In the linear case, the extended Kalman filter of course coincides with the 
Kalman filter; furthermore, in this case the perturbed observation ensemble Kalman 
filter reproduces the true posterior distribution in the large particle limit [40] . However 
the filters introduced in section 14.21 do not produce the correct posterior distribution 
when applied to general nonlinear problems. On the contrary, the particle filter does 
recover the true posterior distribution as the number of particles tends to infinity, as 
we show in Theorem 14.51 This proof is adapted from the very clear exposition in m- 

For more refined analyses of the convergence of particle filters see, for example, [331 
[37] and references therein. As explained in Remarks 14.61 the constant appearing in 
the convergence results may depend exponentially on time if the mixing properties of 
the transition kernel ^{dvj\vj-i) are poor (the undesirable properties of deterministic 
dynamics illustrate this). There is also interesting work studying the effect of the 
dimension [103] . A proof which exploits ergodicity of the transition kernel, when that is 
present, may be found in [37]; the assumptions there on the transition and observation 
kernels are very strong, and are generally not satisfied in practice, but studies indicate 
that comparable results may hold under less stringent conditions. 
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For a derivation and discussion of the optimal proposal , introduced in section 14.3.31 
see [38] and references therein. We also mention here the implicit filters developed by 
Chorin and co-workers |2711261125] . These involve solving an implicit nonlinear equation 
for each particle which includes knowledge of the next set of observed data. This has 
some similarities to the method proposed in |115j and both are related to the optimal 
proposal mentioned above. 

• Section |4]4| The stability of the Kalman filter is a well-studied subject and the book 
[68] provides an excellent overview from the perspective of linear algebra. For exten¬ 
sions to the extended Kalman filter see [63] . Theorem 14.101 provides a glimpse into 
the mechanisms at play within 3DVAR, and approximate Gaussian filters in general, in 
determining stability and accuracy: the incorporation of data can convert unstable dy¬ 
namical systems, with positive Lyapunov exponents, into contractive non-autonomous 
dynamical systems, thereby leading, in the case of small observational noise, to filters 
which recover the true signal within a small error. This idea was highlighted in [24] 
and first studied rigorously for the 3DVAR method applied to the Navier-Stokes equa¬ 
tion in HD; this work was subsequently generalized to a variety of different models in 
[88l iTOl [69] . It is also of note that these analyses of 3DVAR build heavily on ideas de¬ 
veloped in [54] for a specialized form of data assimilation in which the observations are 
noise-free. In the language of the synchronization filter introduced in subsection 14.4.31 
this paper demonstrates the synchronization property (|4.47ll for the Navier-Stokes equa¬ 
tion with sufficiently large number of Fourier mode observations, and for the Lorenz ’63 
model of Example 12.61 observed in only the first component. The paper [69] consider 
similar issues for the Lorenz ’96 model of Example 12.71 Similar ideas are studied for the 
perturbed observation EnKF in mi in this case it is necessary to introduce a form a 
variance inflation to get a result analogous to Theorem 14.101 An important step in the 
theoretical analysis of ensemble Kalman filter methods is the paper [48] which uses ideas 
from the theory of shadowing in dynamical systems; the work proves that the ETKF 
variant can shadow the truth on arbitrarily long time intervals, provided the dimension 
of the ensemble is greater than the number of unstable directions in the system. 

In the context of filter stability it is important to understand the optimality of the mean 
of the true filtering distribution . We observe that all of the filtering algorithms that we 
have described produce an estimate of the probability distribution f‘{vj\Yj) that depends 
only on the data Yj . There is a precise sense in which the true filtering distribution can 
be used to find a lower bound on the accuracy that can be achieved by any of these 
approximate algorithms. We let E(ujjFj) denote the mean of Vj under the probability 
distribution F{vj\Yj) and let Ej{Yj) denote any estimate of the state Vj based only 
on data Yj. Now consider all possible random data sets Yj generated by the model 
(EH), ([221), noting that the randomness is generated by the initial condition vq and the 
noises {^jyijj}] in particular, conditioning on Yj to obtain the probability distribution 
¥{vj\Yj) can be thought of as being induced by conditioning on the observational noise 
{'r]k}k=i,...,j- Then E*{Yj) := E,{vj\Yj) minimizes the mean-square error with respect to 
the random model (12.11) . (IQ) HIMIEI]: 

E||u, - < Ell^;, - E.iY,)^ (4.49) 

for all Ej (Yj ). Thus the algorithms we have described can do no better at estimating the 
state of the system than can be achieved, in principle, from the conditional mean of the 
state given the data E{vj\Yj). This lower bound holds on average over all instances of 
the model. An alternative way to view the inequality (14.49]) is as a means to providing 
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upper bounds on the true filter. For example, under the conditions of Theorem 14.101 
the righthand side of (14.4911 is, asymptotically as j —>■ oo, of size O(e^); thus we deduce 
that 

limsupE||z;j — < Ce^. 

This viewpoint is adopted in |100) where the 3DVAR filter is used to bound the true 
filtering distribution. This latter optimality property can be viewed as resulting from 
the Galerkin orthogonality interpretation of the error resulting from taking conditional 
expectation. 

We have considered large time-behaviour on the assumption that the map 4' can be im¬ 
plemented exactly. In situations where the underlying map 4' arises from a differential 
equation and numerical methods are required, large excursions in phase space caused 
by observational noise can cause numerical instabilities in the integration methods un¬ 
derlying the filters. Remark ?? illustrates this fact in the context of continuous time. 
See [49] for a discussion of this issue. 

• Section [4.51 We mention here the rank histogram. This is another consistency check on 
the output of ensemble or particle based approximations of the filtering distribution 
. The idea is to consider scalar observed quantities consisting of generating ordered 
bins associated to that scalar and then keeping track of the statistics over time of 
the data yj with respect to the bins. For example, if one has an approximation of 
the distribution consisting of N equally-weighted particles, then a rank histogram for 
the first component of the state consists of three steps, each carried out at each time j. 
First, add a random draw from the observational noise iV(0, F) to each particle after the 
prediction phase of the algorithm. Secondly order the particles according to the value of 
their first component, generating N —1 bins between the values of the first component of 
each particle, and with one extra bin on each end. Finally, rank the current observation 
yj between 1 and N + 1 depending on which bin it lands in. Proceeding to do this 
at each time j, a histogram of the rank of the observations is obtained. The “spread” 
of the ensemble can be evaluated using this diagnostic. If the histogram is uniform, 
then the spread is consistent. If it is concentrated to the center, then the spread is 
overestimated. If it is concentrated at the edges, then the spread is underestimated. 
This consistency check on the statistical model used was introduced in [2] and is widely 
adopted throughout the data assimilation community. 


4.7 Exercises 

1. Consider the Kalman filter in the case where M = H = I,Tj = Q and F > 0. Prove 
that the covariance operator Cj converges to 0 as j —oo. Modify the program p8 .m 
so that it applies to this set-up, in the two dimension setting. Verify what you have 
proved regarding the covariance and make a conjecture about the limiting behaviour of 
the mean of the Kalman filter. 

2. Consider the 3DVAR algorithm in the case where 'k(u) = u, i? = /, E = 0 and F > 0. 
Choose C = aF. Find an equation for the error Cj := vj — ruj and derive an expression 
for limj_>oo Ejcj p in terms of a and cr^ := Elry^-p. Modify the program p9.m so that 
it applies to this set-up, in the one dimensional setting. Verify what you have proved 
regarding the limiting behaviour of the ruj. 
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3. Consider the EnKF algorithm in the same setting as the previous example. Modify 
program pl 2 .m so that it applies to this set-up, in the one dimensional setting. Study 
the behaviour of the sequence rrij found as the mean of the particles over the 
ensemble index n. 


4. Consider the SIRS algorithm in the same setting as the previous example. Modify 
program pi4 .m so that it applies to this set-up, in the one dimensional setting. Study 

fn) 

the behaviour of the sequence mj found as the mean of the particles over the 
ensemble index n. 

5. Make comparative comments regarding the 3DVAR, EnKF and SIRS algorithms, on 
the basis of your solutions to the three preceding exercises. 

6 . In this exercise we study the behaviour of the mean of the Kalman filter in the case of 

one dimensional dynamics. The notation follows the development in subsection 14.4.11 
Consider the case a = 0 and assume that the data is generated from a true 

signal {'cJ}jGZ+ governed by the equation 


and that the additive observational noise {? 7 j}jeN is drawn from an i.i.d. sequence with 
variance 7 ^. Define the error ej = rrij — between the estimated mean and the true 
signal and use (I4.37b l to show that 


ej+i - 



Xcj 




Vj+i- 


(4.50) 


Deduce that ej is Gaussian and that its mean and covariance satisfy the equations 


Ee 


1+1 




Ee^, 


and 


( 1 - „ 




3 • ^2 ■ 


Equation (14.511) can be solved to obtain 


Eej = 


no 


. 2=0 


Q + 1 


Een 


(4.51) 


(4.52) 


(4.53) 


and, in a similar way, obtain for the solution of (14.521) : 

+ ^ (4.54) 
7 

Using the properties of the variance derived in 14.4.11 prove that the mean of the error 
tends to zero and that the asymptotic variance is bounded by 7 ^. 


EeJ = X^^ 
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Chapter 5 


Discrete Time: MATLAB Programs 


This chapter is dedicated to illustrating the examples, theory and algorithms, as presented 
in the previous chapters, through a few short and easy to follow MATLAB programs. These 
programs are provided for two reasons: (i) for some readers they will form the best route by 
which to appreciate the details of the examples, theory and algorithms we describe; (ii) for 
other readers they will be a useful starting point to develop their own codes: whilst ours are 
not necessarily the optimal implementations of the algorithms discussed in these notes, they 
have been structured to be simple to understand, to modify and to extend. In particular the 
code may be readily extended to solve problems more complex than those described in the 
Examples 12.1112.71 which we will use for most of our illustrations. The chapter is divided into 
three sections, corresponding to programs relevant to each of the preceding three chapters. 

Before getting into details we highlight a few principles that have been adopted in the 
programs and in accompanying text of this chapter. First, notation is consistent between 
programs, and matches the text in the previous sections of the book as far as possible. 
Second, since many of the elements of the individual programs are repeated, they will be 
described in detail only in the text corresponding to the program in which they first appear; 
the short annotations explaining them will be repeated within the programs however. Third, 
the reader is advised to use the documentation available at the command line for any built- 
in functions of matlab; this information can be accessed using the help command - for 
example the documentation for the command help can be accessed by typing help help. 

5.1 Chapter [2] Programs 

The programs pi .m and p2 .m used to generate the figures in Chapter [5] are presented in 
this section. Thus these algorithms simply solve the dynamical system (EH), and process the 
resulting data. 

5.1.1. pl.m 

The first program pl.m illustrates how to obtain sample paths from equations EH and (EH- 
In particular the program simulates sample paths of the equation 


Uj+i = asm{uj) (5.1) 

with ^ iV(0, tr^) i.i.d. and a = 2.5, both for deterministic {a = 0) and stochastic dy¬ 
namics (tr ^ 0) corresponding to Example 12.31 In line 5 the variable J is defined, which 
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corresponds to the number of forward steps that we will take. The parameters a and cr are 
set in lines 6-7. The seed for the random number generator is set to sd€ N in line 8 using 
the command rng (sd) . This guarantees the results will be reproduced exactly by running 
the program with this same sd. Different choices of sdG N will lead to different streams of 
random numbers used in the program, which may also be desirable in order to observe the 
effects of different random numbers on the output. The command sd will be called in the 
preamble of all of the programs that follow. In line 9, two vectors of length J are created 
named v and vnoise; after running the program, these two vectors contain the solution 
for the case of deterministic (a = 0) and stochastic dynamics (cr = 0.25) respectively. After 
setting the initial conditions in line 10, the desired map is iterated, without and with noise, in 
lines 12 — 15. Note that the only difference between the forward iterations of v and vnoise 
is the presence of the sigma*randn term, which corresponds to the generation of a ran¬ 
dom variable sampled from N{0, a^). Lines 17-18 contain code which graphs the trajectories, 
with and without noise, to produce Figure 12.31 Figures 12.112.21 and 12.51 were obtained by 
simply modifying lines 12 — 15 of this program, in order to create sample paths for the corre¬ 
sponding t]/ for the other three examples; furthermore, Figure[2j4^ was generated from output 
of this program and Figure [2T4 |d was generated from output of a modification of this program. 


clear;set(0,'defaultaxesfontsize',20);format long 
%%% pl.m - behaviour of sin map (Ex. 1.3) 

%%% with and without observational noise 

J=10000;% number of steps 
alpha=2.5;% dynamics determined by alpha 
sigma=0.25;% dynamics noise variance is sigma"2 
sd=l;rng(sd);% choose random number seed 

v=zeros(J+1,1); vnoise=zeros(J+1,1);% preallocate space 
V(1)=1;vnoise(1)=1;% initial conditions 

for i=l:J 

V(i+1)=alpha*sin(v(i)); 

vnoise (i + 1)=alpha*sin(vnoise(i))+sigma*randn; 

end 

figure(l), plot([0:1:J],v), 
figure(2), plot([0:1:J],vnoise), 
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5.1.2. p2.m 


The second program presented here, p2 .m, is designed to visualize the posterior distribution 
in the case of one dimensional deterministic dynamics . For clarity, the program is separated 
into three main sections. The setup section in lines 5-10 defines the parameters of the 
problem. The model parameter r is dehned in line 6, and determines the dynamics of the 
forward model, in this case given by the logistic map (I2.9|l : 

Vj+i=rvj{l-Vj). (5.2) 

The dynamics are taken as deterministic, so the parameter sigma does not feature here. 
The parameter r= 2 so that the dynamics are not chaotic, as the explicit solution given in 
Example 12.41 shows. The parameters mO and CO define the mean and covariance of the prior 
distribution vq ~ N{mo,CQ), whilst gamma dehnes the observational noise rij ~ iV(0,7^). 

The truth section in lines 14-20 generates the true reference trajectory (or, truth) vt in 
line 18 given by ()5.2I1 . as well as the observations y in line 19 given by 

yj=Vj+r]j. (5.3) 


Note that the index of y (:, j ) corresponds to observation of H*v ( :, j + 1) . This is due to 
the fact that the hrst index of an array in matlab is j = l, while the initial condition is ro, 
and the first observation is of vi. So, effectively the indices of y are correct as corresponding 
to the text and equation (15.31) , but the indices of v are one off. The memory for these vectors 
is preallocated in line 14. This is not necessary because matlab would simply dynamically 
allocate the memory in its absence, but it would slow down the computations due to the 
necessity of allocating new memory each time the given array changes size. Commenting 
this line out allows observation of this effect, which becomes significant when J becomes 
sufficiently large. 

The solution section after line 24 computes the solution, in this case the point-wise 
representation of the posterior smoothing distribution on the scalar initial condition. The 
point-wise values of initial condition are given by the vector vO (uq) defined in line 24. 
There are many ways to construct such vectors, this convention defines the initial (0.01) 
and final (0.99) values and a uniform step size 0.0005. It is also possible to use the command 
v0=l inspace (0.01,0.99,1961), defining the number 1961 of intermediate points, rather 
than the stepsize 0.0005. The corresponding vector of values of Phidet ($det), Jdet (Jdet), 
and Idet (Uet) are computed in lines 32, 29, and 34 for each value of vO, as related by the 
equation 

ldet(wo; y) = Jdet('Co) + $det(To; 2/), (5.4) 

where Jdet(To) is the background penalization and 4>det(To;y) is the model-data misfit 
functional given by (I2.29b l and (|2.29b l respectively. The function ldet(To;y) is the negative 
log-posterior as given in Theorem 12.111 Having obtained ldet('Co;y) we calculate P(uo|y) in 
lines 37-38, using the formula 


P(to|2/) 


exp(-ldet(r>o;2/)) 

/ exp(-ldet('Co;2/))' 


(5.5) 


The trajectory v corresponding to the given value of vq (vO (1)) is denoted by vv and is 
replaced for each new value of vO (1) in lines 29 and 31 since it is only required to compute 
Idet. The command trapz (v0,exp (-Idet) ) in line 37 approximates the denominator 
of the above by the trapezoidal rule, i.e. the summation 


N-l 

trapz(v0, exp(—Idet)) = (vO(i -|- 1) — v0(i)) + (ldet(i + 1)4- Idet(i))/2. (5.6) 

i=l 
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The rest of the program deals with plotting our results and in this instance it coincides with 
the output of Fisfure l2.11b . Again simple modifications of this program were used to produce 
Fisfures [2.10112.121 and 12.131 Note that rng(sd) in line 8 allows us to use the same random 
numbers every time the file is executed; those random numbers are generated with the seed 
sd as described in the previous section 15.1.11 Commenting this line out would result in the 
creation of new realizations of the random data y, different from the ones used to obtain 
Figure [2TTb . 


1 clear; set(0,'defaultaxesfontsize',20); format long 

2 %%% p2.m smoothing problem for the deterministic logistic map (Ex. 1.4) 

3 %% setup 

4 

5 J=1000;% number of steps 

6 r=2;% dynamics determined by r 

7 gamma=0.1;% observational noise variance is gamma“2 

8 C0=0.01;% prior initial condition variance 

9 m0=0.7;% prior initial condition mean 

10 sd=l;rng(sd);% choose random number seed 

11 

12 %% truth 

13 

14 vt=zeros(J+1,1); y=zeros(J,1);% preallocate space to save time 

15 vt(l)=0.1;% truth initial condition 

16 for j=l:J 

17 % can be replaced by Psi for each problem 

18 vt(j+1)=r*vt(j)*(1-vt(j));% create truth 

19 y(j)=vt(j+1)+gamma*randn;% create data 

20 end 

21 

22 %% solution 

23 

24 v0=[0.01:0.0005:0. 99] ;% construct vector of different initial data 

25 Phidet=zeros(length(vO),1);Idet=Phidet;Jdet=Phidet;% preallocate space 

26 vv=zeros(J,1);% preallocate space 

27 % loop through initial conditions vvO, and compute log posterior 10(vvO) 

28 for j=l:length(vO) 

29 vv(1)=v0(j); Jdet(j)=1/2/CO*(vO(j)-mO)"2;% background penalization 

30 for i=l:J 

31 vv(i+1)=r*vv(i)*(1-vv(i)); 

32 Phidet(j)=Phidet(j)+l/2/gamma"2*(y(i)-vv(i+1))"2;% misfit 

33 end 

34 Idet(j)=Phidet(j)+Jdet(j); 

35 end 

36 

37 constant=trapz(vO,exp(-Idet));% approximate normalizing constant 

38 P=exp(-Idet)/constant;% normalize posterior distribution 

39 prior=normpdf(vO,mO,sqrt(CO));% calculate prior distribution 

40 

41 figure(l),piot(v0,prior,'k','LineWidth' , 2) 

42 hold on, plot(v0,P,'r—LineWidth',2), xlabel 'v_0', 

43 legend 'prior' J=10''3 
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5.2 Chapter [3] Programs 

The programs p3 .m-p7 .m, used to generate the figures in Chapter [31 are presented in this 
section. Hence various MCMC algorithms used to sample the posterior smoothing distribution 
are given. Furthermore, optimization algorithms used to obtain solution of the 4DVAR and 
w4DVAR variational methods are also introduced. Our general theoretical development of 
MCMC methods in section [ 32 ] employs a notation of u for the state of the chain and w for 
the proposal. For deterministic dynamics the state is the initial condition no; for stochastic 
dynamics it is either the signal v or the pair (no,f) where ^ is the noise (since this pair 
determines the signal). Where appropriate, the programs described here use the letter n, and 
variants on it, for the state of the Markov chain, to keep the connection with the underlying 
dynamics model. 

5.2.1. p3.m 

The program p3.m contains an implementation of the Random Walk Metropolis (RWM) 
MCMC algorithm. The development follows section 13.2.31 where the algorithm is used to 
determine the posterior distribution on the initial condition arising from the deterministic 
logistic map of Example 12.41 given by (15.21) . Note that in this case, since the the underlying 
dynamics are deterministic and hence completely determined by the initial condition, the 
RWM algorithm will provide samples from a probability distribution on R. 

As in program p2 .m, the code is divided into 3 sections: setup where parameters are 
defined, truth where the truth and data are generated, and solution where the solution is 
computed, this time by means of MCMC samples from the posterior smoothing distribution. 
The parameters in lines 5-10 and the true solution (here taken as only the initial condition, 
rather than the trajectory it gives rise to) vt in line 14 are taken to be the same as those 
used to generate Figure [3. 131 The temporary vector vv generated in line 19 is the trajectory 
corresponding to the truth (vv(l)=vt in line 14), and used to calculate the observations 
y in line 20. The true value vt will also be used as the initial sample in the Markov chain 
for this and for all subsequent MCMC programs. This scenario is, of course, not possible 
in the case that the data is not simulated. However it is useful in the case that the data is 
simulated, as it is here, because it can reduce the burn-in time, i.e. the time necessary for the 
current sample in the chain to reach the target distribution, or the high-probability region 
of the state-space. Because we initialize the Markov chain at the truth, the value of Idet('C^), 
denoted by the temporary variable Idet, is required to determine the initial acceptance 
probability, as described below. It is computed in lines 15-23 exactly as in lines 25-34 of 
program p2 .m, as described around equation (15.411 . 

In the solution section some additional MCMC parameters are defined. In line 28 the 
number of samples is set to N =10®. For the parameters and specific data used here, this 
is sufficient for convergence of the Markov chain. In line 30 the step-size parameter beta 
is pre-set such that the algorithm for this particular posterior distribution has a reasonable 
acceptance probability, or ratio of accepted vs. rejected moves. A general rule of thumb 
for this is that it should be somewhere around 0.5, to ensure that the algorithm is not too 
correlated because of high rejection rate (acceptance probability near zero) and that it is 
not too correlated because of small moves (acceptance probability near one). The vector V 
defined in line 29 will save all of the samples. This is an example where preallocation is very 
important. Try using the commands tic and toe before and respectively after the loop in 
lines 33-50 in order to time the chain both with and without preallocation. 0 In line 34 a 


^In practice, one may often choose to collect certain statistics from the chain “on-the-fly” rather than 
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move is proposed according to the proposal equation (13.151) : 

where v{v) is the current state of the chain (initially taken to be equal to the true initial 
condition uq), t(*^“^^=randn is an i.i.d. standard normal, and w represents Indices are 

not used for v and w because they will be overwritten at each iteration. 

The temporary variable vv is again used for the trajectory corresponding to as a 
vehicle to compute the value of the proposed ldet(w(^); y), denoted in line 42 by lOprop = 
JOprop + Phiprop. In lines 44-46 the decision to accept or reject the proposal is made 
based on the acceptance probability 

a(„(fc-i), = 1 A exp(ldet(?^^"-'); 2/) - ; 2/))- 

In practice this corresponds to drawing a uniform random number rand and replacing v and 
Idet in line 45 with w and lOprop if rand<exp {lO-IOprop) in line 44. The variable 
bb is incremented if the proposal is accepted, so that the running ratio of accepted moves 
bb to total steps n can be computed in line 47. This approximates the average acceptance 
probability. The current sample is stored in line 48. Notice that here one could replace v 
by V (n-1) in line 34, and by V (n) in line 45, thereby eliminating v and line 48, and letting 
w be the only temporary variable. However, the present construction is favourable because, 
as mentioned above, in general one may not wish to save every sample. 

The samples V are used in lines 51-53 in order to visualize the posterior distribution. In 
particular, bins of width dx are defined in line 51, and the command hist is used in line 52. 
The assignment Z = hist{V,vO) means first the real-number line is split into M bins with 
centers defined according to vO {i ) for i = 1,..., M, with the first and last bin corresponding 
to the negative, respectively positive, half-lines. Second, Z (i) counts the number of k for 
which V{k) is in the bin with center determined by vO (i) . Again, trapz dSH) is used 
to compute the normalizing constant in line 53, directly within the plotting command. The 
choice of the location of the histogram bins allows for a direct comparison with the posterior 
distribution calculated from the program p2.m by directly evaluating ldet(w;2/) defined in 
(l5Tl) for different values of initial conditions v. This output is then compared with the 
corresponding output of p2 .m for the same parameters in Figure [321 


saving every sample, particularly if the state-space is high-dimensional where the memory required for each 
sample is large. 
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clear; set(0,'defaultaxesfontsize',20); format long 
%%% p3.m MCMC RWM algorithm for logistic map (Ex. 1.4) 

%% setup 

J=5;% number of steps 

r=4;% dynamics determined by alpha 

gamma=0.2;% observational noise variance is gamma“2 
C0=0.01;% prior initial condition variance 
m0=0.5;% prior initial condition mean 
sd=10;rng(sd);% choose random number seed 

%% truth 

vt=0.3;vv(1)=vt;% truth initial condition 
Jdet=l/2/CO*(vt-mO)"2;% background penalization 
Phidet=0;% initialization model-data misfit functional 
for j=l:J 

% can be replaced by Psi for each problem 
vv(j+1)=r*vv(j)*(1-vv(j));% create truth 
y(j)=vv(j+1)+gamma*randn;% create data 

Phidet=Phidet+l/2/gamma"2*(y(j)-vv(j+1))“2;% misfit functional 

end 

Idet=Jdet+Phidet;% compute log posterior of the truth 
%% solution 

% Markov Chain Monte Carlo: N forward steps of the 
% Markov Chain on R (with truth initial condition) 

N=le5;% number of samples 

V=zeros(N,1);% preallocate space to save time 
beta=0.05;% step-size of random walker 
v=vt;% truth initial condition (or else update 10) 
n=l; bb=0; rat(l)=0; 
while n<=N 

w=v+sqrt(2*beta)*randn;% propose sample from random walker 
vv(l)=w; 

Jdetprop=l/2/CO*(w-mO)"2;% background penalization 
Phidetprop=0; 
for i=l:J 

vv(i+1)=r*vv(i)*(1-vv(i)); 

Phidetprop=Phidetprop+l/2/gamma"2 *(y(i)-vv(i + 1)) "2; 

end 

Idetprop=Jdetprop+Phidetprop;% compute log posterior of the proposal 

if rand<exp(Idet-Idetprop)% accept or reject proposed sample 
v=w; Idet=Idetprop; bb=bb+l;% update the Markov chain 

end 

rat(n)=bb/n;% running rate of acceptance 
V(n)=v;% store the chain 
n=n+l; 

end 

dx=0.0005; v0=[0.01:dx:0.99]; 

Z=hist(V,vO);% construct the posterior histogram 

figure(l), plot(vO,Z/trapz(vO,Z),'k','Linewidth',2)% visualize posterior 
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5.2.2. p4.m 

The program p4 .m contains an implementation of the Independence Dynamics Sampler for 
stochastic dynamics , as introduced in subsection 13.2.41 Thus the posterior distribution is on 
the entire signal {vj}j^^. The forward model in this case is from Example 12.31 given by (15.11) . 
The smoothing distribution P(t;|F) is therefore over the state-space 

The sections setup, truth, and solution are defined as for program p3 .m, but note 
that now the smoothing distribution is over the entire path, not just over the initial condition, 
because we are considering stochastic dynamics . Since the state-space is now the path-space, 
rather than the initial condition as it was in program p3 .m, the truth vtG is now a 

vector. Its initial condition is taken as a draw from N{mQ,Co) in line 16, and the trajectory 
is computed in line 20, so that at the end vt~ po. As in program p3 .m, (vt) will be the 
chosen initial condition in the Markov chain (to ameliorate burn-in issues) and so ; y) is 
computed in line 23. Recall from subsection l3.2.4l that only 4>(-; y) is required to compute the 
acceptance probability in this algorithm. 

Notice that the collection of samples V£ preallocated in line 30 is substantial in 

this case, illustrating the memory issue which arises when the dimension of the signal space, 
and number of samples, increase. 

The current state of the chain and the value of ^{v^^^-,y) are again denoted v and 
Phi, while the proposal and the value of y) are again denoted w and Phiprop, 

as in program p3. As discussed in section [3.2.41 the proposal is an independent sample 
from the prior distribution po, similarly to and it is constructed in lines 34-39. The 
acceptance probability used in line 40 is now 

= 1 Aexp($(u('^-i);y) - ] y)), (5.7) 

The remainder of the program is structurally the same as p3.m. The outputs of this 
program are used to plot Figures 153113.41 and 13.51 Note that in the case of Figure [H31 we 
have used = 10® samples. 
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1 clear; set(0,'defaultaxesfontsize',20); format long 

2 %%% p4.m MCMC INDEPENDENCE DYNAMICS SAMPLER algorithm 

3 %%% for sin map (Ex. 1.3) with noise 

4 %% setup 

5 

6 J=10;% number of steps 

7 alpha=2.5;% dynamics determined by alpha 

8 gamma=l;% observational noise variance is gamma"2 

9 sigma=l;% dynamics noise variance is sigma"2 

10 C0=1;% prior initial condition variance 

11 m0=0;% prior initial condition mean 

12 sd=0;rng(sd);% choose random number seed 

13 

14 %% truth 

15 

16 vt(1)=m0+sqrt(CO)*randn;% truth initial condition 

17 Phi=0; 

18 

19 for j=l:J 

20 vt(j+1)=alpha*sin(vt(j))+sigma*randn;% create truth 

21 y(j)=vt{j+1)+gamma*randn;% create data 

22 % calculate log likelihood of truth, Phi(v;y) from (1.11) 

23 Phi=Phi + l/2/gamma"2 *(y(j)-vt(j + 1))"2; 

24 end 

25 

26 %% solution 

27 % Markov Chain Monte Carlo: N forward steps of the 

28 % Markov Chain on R^lJ+l} with truth initial condition 

29 N=le5;% number of samples 

30 V=zeros(N,J+1);% preallocate space to save time 

31 v=vt;% truth initial condition (or else update Phi) 

32 n=l; bb=0; rat(l)=0; 

33 while n<=N 

34 w(1)=sqrt(CO)*randn;% propose sample from the prior 

35 Phiprop=0; 

36 for j=l:J 

37 w(j+1)=alpha*sin(w(j))+sigma*randn;% propose sample from the prior 

38 Phiprop=Phiprop+l/2/gamma"2*(y(j)-w(j + 1))"2;% compute likelihood 

39 end 

40 if rand<exp(Phi-Phiprop)% accept or reject proposed sample 

41 v=w; Phi=Phiprop; bb=bb+l;% update the Markov chain 

42 end 

43 rat (n)=bb/n;% running rate of acceptance 

44 V (n, :)=v;% store the chain 

45 n=n+l; 

46 end 

47 % plot acceptance ratio and cumulative sample mean 

48 figure;plot(rat) 

49 figure;plot(cumsum(V(1:N,1))./[1:N]') 

50 xlabel{'samples N') 

51 ylabel (' (1/N) \Sigma_{n=l}"N v_0"{(n)}') 
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5.2.3. p5.m 

The Independence Dynamics Sampler of subsection 15.2.21 may be very inefficient as typical 
random draws from the dynamics may be unlikely to fit the data as well as the current 
state, and will then be rejected. The fifth program p5.m gives an implementation of the 
pCN algorithm from section [3.2.41 which is designed to overcome this issue by including the 
parameter (3 which, if chosen small, allows for incremental steps in signal space and hence 
the possibility of non-negligible acceptance probabilities. This program is used to generate 
Figure EH 

This program is almost identical to p4 .m, and so only the points at which it differs will 
be described. First, since the acceptance probability is given by 

= 1 A exp($(n('=-^);y) - 

the quantity 

i=o 

will need to be computed, both for (denoted by v in lines 31 and 44) where its value is 
denoted by G as well as for G{v^) is computed in line 22), and for w^^'> (denoted 

by w in line 36) where its value is denoted by Gprop in line 39. 

As discussed in section E. 2. 41 the proposal is given by (I3.19|) : 

=TO+(l-/32)5(i;(fc-l) + (5.8) 

here ^ N{0,G) are i.i.d. and denoted by iota in line 35. G is the covariance of the 

Gaussian measure ttq given in Equation (12.241) corresponding to the case of trivial dynamics 
= 0, and m is the mean of ttq. The value of m is given by m in line 33. 
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1 

clear;set(0,'defaultaxesfontsize',20);format long 


2 

Q 

%%% p5.m MCMC pCN algorithm for sin map (Ex. 1.3) with noise 

4 

%% setup 


5 

J=10;% number of steps 


6 

alpha=2.5;% dynamics determined by alpha 


7 

gamma=l;% observational noise variance is gamma"2 


8 

sigma=.l;% dynamics noise variance is sigma"2 


9 

C0=1;% prior initial condition variance 


10 

m0=0;% prior initial condition mean 


11 

sd=0;rng(sd);% Choose random number seed 


12 



13 

%% truth 


14 

vt(1)=m0+sqrt(CO)*randn;% truth initial condition 


15 

G=0;Phi=0; 


16 



17 

for j=l:J 


18 

vt ( j + 1)=alpha*sin(vt(j))+sigma*randn;% create truth 


19 

y(j)=vt(j+1)+gamma*randn;% create data 


20 

% calculate log density from (1.—) 


21 

G=G+1/2/sigma"2*((alpha*sin(vt(j)))"2-2*vt(j + 1)*alpha*sin(vt(j) ) ) ; 

22 

% calculate log likelihood phi(u;y) from (1.11) 


23 

Phi=Phi + l/2/gamma"2 *(y(j)-vt(j + 1))"2; 


24 

end 


25 



26 

%% solution 


27 

% Markov Chain Monte Carlo: N forward steps 


28 

N=le5;% number of samples 


29 

beta=0.02;% step-size of pCN walker 


30 

v=vt;% truth initial condition (or update G + Phi) 


31 

V=zeros(N,J+1); n=l; bb=0; rat=0; 


32 

m=[m0,zeros(l,J)]; 


33 

while n<=N 


34 

iota=[sqrt(CO)*randn,sigma*randn(1,J)];% Gaussian prior 

sample 

35 

w=m+sqrt(1-beta"2)*(v-m)+beta*iota;% propose pCN sample 


36 

Gprop=0;Phiprop=0; 


37 

for j=l:J 


38 

Gprop=... 


39 

Gprop+1/2/sigma"2*((alpha*sin(w(j)))"2-2*w(j+1)*alpha*sin(w(j))); 

40 

Phiprop=Phiprop+l/2/gamma"2 *(y(j)-w(j + 1))"2; 


41 

end 


42 

if rand<exp(Phi-Phiprop+G-Gprop)% accept or reject proposed sample 

43 

v=w;Phi=Phiprop;G=Gprop;bb=bb+l;% update the Markov 

chain 

44 

end 


45 

rat(n)=bb/n;% running rate of acceptance 


46 

V(n,:)=v;% store the chain 


47 

n=n+l; 


48 

end 


49 

% plot acceptance ratio and cumulative sample mean 


50 

figure,-plot(rat) 


51 

figure;plot(cumsum(V(1:N,1))./[1:N]') 


52 

xlabel('samples N') 


53 

ylabel (' (1/N) \Sigma_{n=l}"N v_0"{(n)}') 
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5.2.4. p6.m 

The pCN dynamics sampler is now introduced as program p 6 . m. The Independence Dynamics 
Sampler of subsection 15.2.21 may be viewed as a special case of this algorithm for proposal 
variance (3 = 1. This proposal combines the benefits of tuning the step size /3, while still 
respecting the prior distribution on the dynamics. It does so by sampling the initial condition 
and noise (uo,f) rather than the path itself, in lines 34 and 35, as given by Equation ()5.8I1 . 
However, as opposed to the pCN sampler of the previous section, this variable w is now 
interpreted as a sample of and is therefore fed into the path vv itself in line 39. The 

acceptance probability is the same as the Independence Dynamics Sampler (15.71) . depending 
only on Phi. If the proposal is accepted, both the forcing u=w and the path v=vv are updated 
in line 44. Only the path is saved as in the previous routines, in line 47. 
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1 

2 

3 

4 

5 

6 

clear;set(0,'defaultaxesfontsize',20);format long 
%%% p6.m MCMC pCN Dynamics algorithm for 
%%% sin map (Ex. 1.3) with noise 
%% setup 


J=10;% number of steps 


7 

alpha=2.5;% dynamics determined by alpha 


8 

gamma=l;% observational noise variance is gamma"2 


9 

sigma=l;% dynamics noise variance is sigma"2 


10 

C0=1;% prior initial condition variance 


11 

m0=0;% prior initial condition mean 


12 

sd=0;rng(sd);% Choose random number seed 


13 



14 

%% truth 


15 

vt(1)=m0+sqrt(CO)*randn;% truth initial condition 


16 

ut(1)=vt (1); 


17 

Phi=0; 


18 

for j=l:J 


19 

ut(jtl)=sigma*randn; 


20 

vt ( j + 1)=alpha*sin(vt(j))+ut(j + 1);% create truth 


21 

y(j)=vt{j+1)+gamma*randn;% create data 


22 

% calculate log likelihood phi(u;y) from (1.11) 


23 

Phi=Phi + l/2/gamma"2 *(y(j)-vt(j + 1))"2; 


24 

end 


25 



26 

%% solution 


27 

% Markov Chain Monte Carlo: N forward steps 


28 

N=le5;% number of samples 


29 

beta=0.2;% step-size of pCN walker 


30 

u=ut;v=vt;% truth initial condition (or update Phi) 


31 

V=zeros(N,J+1); n=l; bb=0; rat=0;m=[mO,zeros(1,J)]; 


32 

while n<=N 


33 

iota=[sqrt(CO)*randn,sigma*randn(1,J)];% Gaussian prior sample 

34 

w=m+sqrt(l-beta“2)*(u-m)+beta*iota;% propose pCN sample 

35 

vv(1)=w(1); 


36 

Phiprop=0; 


37 

\ —1 

II 

•I — 1 

0 

M-l 


38 

vv(j+1)=alpha*sin(vv(j))+w(j+1);% create path 


39 

Phiprop=Phiprop+l/2/gamma"2 *(y(j)-vv(j + 1))"2; 


40 

end 


41 



42 

if rand<exp(Phi-Phiprop)% accept or reject proposed 

sample 

43 

u=w;v=vv;Phi=Phiprop;bb=bb+l;% update the Markov 

chain 

44 

end 


45 

rat (n)=bb/n;% running rate of acceptance 


46 

V(n,:)=v;% store the chain 


47 

n=n+l; 


48 

end 


49 

% plot acceptance ratio and cumulative sample mean 


50 

figure,-plot(rat) 


51 

figure;plot(cumsum(V(1:N,1))./[1:N]') 


52 

xlabel('samples N') 


53 

ylabel (' (1/N) \Sigma_{n=l}"N v_0"{(n)}') 
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5.2.5. p7.m 

The next program p7 .m contains an implementation of the weak constrained variational al¬ 
gorithm w4DVAR discussed in section 13.31 This program is written as a function, whilst all 
previous programs were written as scripts. This choice was made for p7 .m so that the MAT- 
LAB built-in function fminsearch can be used for optimization in the solution section, 
and the program can still be self-contained. To use this built-in function it is necessary to 
define an auxiliary objective function I to be optimized. The function fminsearch can be 
used within a script, but the auxiliary function would then have to be written separately, so 
we cannot avoid functions altogether unless we write the optimization algorithm by hand. 
We avoid the latter in order not to divert the focus of this text from the data assimilation 
problem, and algorithms to solve it, to the problem of how to optimize an objective function. 

Again the forward model is that given by Example 12.81 namely (EU. The setup and 
truth sections are similar to the previous programs, except that G, for example, need not be 
computed here. The auxiliary objective function I in this case is I(•;?/) from equation (I2.21|) 
given by 

Ksj/) = J(-)+ <!>(•; y), (5.9) 

where 

J(w) := -j\Co^{uo-mo)\^ + ^ - «'(%)) |^ (5.10) 

7=0 

and 

j-i . 

Hu]y) = XI 2\^~^{y3+i - ^(Wj+i))|^- (5-11) 

7=0 

It is defined in lines 38-45. The auxiliary objective function takes as inputs (u, y, sigma, gamma, 
alpha, mO, CO, J), and gives output out= \{u]y) where u S (given all the other pa¬ 

rameters in its definition - the issue of identifying the input to be optimized over is discussed 
also below). 

The initial guess for the optimization algorithm uu is taken as a standard normal random 
vector over in line 27. In line 24, a standard normal random matrix of size 100^ is 

drawn and thrown away. This is so one can easily change the input, e.g. to randn (z) 
for z£ N, and induce different random initial vectors uu for the optimization algorithm, 
while keeping the data fixed by the random number seed sd set in line 12. The truth vt 
may be used as initial guess by uncommenting line 28. In particular, if the output of the 
minimization procedure is different for different initial conditions, then it is possible that the 
objective function !(•;?/) has multiple minima, and hence the posterior distribution P(-|y) is 
multi-modal. As we have already seen in Figure [3781 this is certainly true even in the case of 
scalar deterministic dynamics , when the underlying map gives rise to a chaotic flow. 

The MATLAB optimization function fminsearch is called in line 32. The function handle 
command @ (u) I (u, • • • ) is used to tell fminsearch that the objective function I is to be 
considered a function of u, even though it may take other parameter values as well (in this 
case, y, sigma, gamma, alpha, mO, CO, and j). The outputs of fminsearch are the value 
vmap such that I (vmap) is minimum, the value fval = I (vmap) , and the exit flag 
which takes the value 1 if the algorithm has converged. The reader is encouraged to use the 
help command for more details on this and other MATLAB functions used in the notes. The 
results of this minimization procedure are plotted in lines 34-35 together with the true value 
as well as the data y. In Figure [371] such results are presented, including two minima which 
were found with different initial conditions. 
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1 

function this=p7 


2 

clear;set(0,'defaultaxesfontsize',20);format long 


3 

%%% p7.m weak 4DVAR for sin map (Ex. 1.3) 


4 

5 

6 

%% setup 


J=5;% number of steps 


7 

alpha=2.5;% dynamics determined by alpha 


8 

gamma=le0;% observational noise variance is gamma“2 


9 

sigma=l;% dynamics noise variance is sigma"2 


10 

C0=1;% prior initial condition variance 


11 

m0=0;% prior initial condition mean 


12 

13 

sd=l;rng(sd);% choose random number seed 


14 

15 

%% truth 


16 

vt(1)=sqrt(CO)*randn;% truth initial condition 


17 

for j=l:J 


18 

vt ( j + 1)=alpha*sin(vt(j))+sigma*randn;% create truth 


19 

y(j)=vt(j+1)+gamma*randn;% create data 


20 

21 

end 


22 

23 

%% solution 


24 

randn(100);% try uncommenting or changing the argument 

for different 

25 

% initial conditions — if the result is not 

the same. 

26 

% there may be multimodality (e.g. 1 & 100). 


27 

uu=randn(1,J+1);% initial guess 


28 

29 

%uu=vt; % truth initial guess option 


30 

% solve with blackbox 


31 

% exitflag=l ==> convergence 


32 

33 

[vmap,fval,exitflag]=fminsearch(@(u)I(u,y,sigma,gamma,alpha 

,mO,CO,J),uu) 

34 

figure; plot([0:J] , vmap, 'Linewidth',2);hold;plot([0:J],vt,'r 

','Linewidth',2) 

35 

36 

plot([1:J] ,y, 'g' , 'Linewidth' ,2);hold;xlabel('j');legend('MAP','truth','y') 

37 

%% auxiliary objective function definition 


38 

39 

function out=I(u,y,sigma,gamma,alpha,mO,CO, J) 


40 

Phi=0;JJ=l/2/C0*(u(1)-mO)"2; 


41 

for j=l:J 


42 

JJ=JJ+1/2/sigma"2*(u(j + 1)-alpha*sin(u(j)))"2; 


43 

Phi=Phi + l/2/gamma"2 *(y(j)-u(j + l))"2; 


44 

end 


45 

out=Phi+JJ; 
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5.3 Chapter [4] Programs 

The programs p8.m-pl5.m, used to generate the figures in Chapter 31 are presented in 
this section. Various filtering algorithms used to sample the posterior filtering distribution 
are given, involving both Gaussian approximation and particle approximation. Since these 
algorithms are run for very large times (large J), they will only be divided in two sections, 
setup in which the parameters are defined, and solution in which both the truth and 
observations are generated, and the online assimilation of the current observation into the 
filter solution is performed. The generation of truth can be separated into a truth section as 
in the previous sections, but two loops of length J would be required, and loops are inefficient 
in MATLAB , so the present format is preferred. The programs in this section are all very 
similar, and their output is also similar, giving rise to Figures 14^4.121 With the exception 
of p8.m and p9.m, the forward model is given by Example [2:1] (EH), and the output is 
identical, given for plO .m through pl5 .m in Figures lT5ll4.7l and Id.SM.TOl Figures [4.111 and 
14.121 compare the filters from the other Figures. p8.m features a two-dimensional linear 
forward model, and p9 .m features the forward model from Example 12.91 (15.21) . At the end of 
each program, the outputs are used to plot the mean and the covariance as well as the mean 
square error of the filter as functions of the iteration number j. 

5.3.1. p8.m 

The first filtering program is p8 .m which contains an implementation of the Kalman Filter 
applied to Example 12.21 


Vj+i = Avj + ^j, with A = 


0 

-1 


1 

0 


and observed data given by 

Vj+i = Hvj+i + rjj+i 

with H = (1,0) and Gaussian noise. Thus only the first component of Vj is observed. 

The parameters and initial condition are defined in the setup section, lines 3-19. The 
vectors v, m € yG and c € are preallocated to hold the truth, mean, 

observations, and covariance over the J observation times defined in line 5. In particular, 
notice that the true initial condition is drawn from iV(mo, Cq) in line 16, where mo = 0 and 
Co = 1 are defined in lines 10-11. The initial estimate of the distribution is defined in lines 
17-18 as N{mo, Co), where mo ^ A^(0,100/) and Co ^ lOOCo so that the code may test the 
ability of the filter to lock onto the true distribution, asymptotically in j, given a poor initial 
estimate. That is to say, the values of (mo, Co) are changed such that the initial condition is 
not drawn from this distribution. 

The main solution loop then follows in lines 21-34. The truth v and the data that 
are being assimilated y are sequentially generated within the loop, in lines 24-25. The filter 
prediction step, in lines 27-28, consists of computing the predictive mean and covariance fhj 
and Cj as defined in (j4.4ll and (j4.5ll respectively: 

fhj+i = Amj, Cj+i = ACjA^ + E. 

Notice that indices are not used for the transient variables mhat and chat representing fhj 
and Cj because they will not be saved from one iteration to the next. In lines 30-33 we 
implement the analysis formulae for the Kalman filter from Corollary 14.21 In particular, the 
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innovation between the observation of the predicted mean and the actual observation, as 
introduced in Corollary 14.21 is first computed in line 30 

dj = Uj — Hrfij. (5-12) 

Again d, which represents dj, does not have any index for the same reason as above. Next, 
the Kalman gain defined in Corollary 14.21 is computed in line 31 

Kj = CjH^{HCjH'^ + r)-^ (5.13) 

Once again index j is not used for the transient variable K representing Kj. Notice the ’’for¬ 
ward slash” / is used to compute B/A=B A“^. This is an internal function of MATLAB which 
will analyze the matrices B and A to determine an “optimal” method for inversion, given their 
structure. The update given in Corollary 14.21 is completed in lines 30-32 with the equations 

rrij = fhj -\- Kjdj and Cj = {I — KjH)Cj. (5-14) 

Finally, in lines 36-50 the outputs of the program are used to plot the mean and the 
covariance as well as the mean square error of the filter as functions of the iteration number 
j, as shown in Figure |T3l 
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1 

clear;set(0,'defaultaxesfontsize',20);format long 


2 

%%% p8.m Kalman Filter, Ex. 1.2 


3 

4 

%% setup 


5 

J=le3;% number of steps 


6 

N=2;% dimension of state 


7 

I=eye(N);% identity operator 


8 

gamma=l;% observational noise variance is gamma"2*I 


9 

sigma=l;% dynamics noise variance is sigma"2*I 


10 

C0=eye(2);% prior initial condition variance 


11 

m0=[0;0];% prior initial condition mean 


12 

sd=10;rng(sd);% choose random number seed 


13 

A=[0 1;-1 0];% dynamics determined by A 


14 



15 

m=zeros(N,J);v=m;y=zeros(J,l);c=zeros(N,N,J);% pre-allocate 


16 

V(:,1)=m0+sqrtm(CO)*randn(N,1);% initial truth 


17 

m(:,1)=10*randn(N,1);% initial mean/estimate 


18 

c ( :, :,1)=100*C0;% initial covariance 


19 

H=[1,0];% observation operator 


20 



21 

%% solution % assimilate! 


22 



23 

for j=l:J 


24 

V(:,j+1)=A*v(:,j) + sigma*randn(N,1);% truth 


25 

y(j)=H*v(:,j+1)+gamma*randn;% observation 


26 



27 

mhat=A*m(:,j);% estimator predict 


28 

chat=A*c(:,:,j)*A'isigma"2*I;% covariance predict 


29 



30 

d=y(j)“H*mhat;% innovation 


31 

K=(chat*H')/(H*chat*H'+gamma"2);% Kalman gain 


32 

m(:,j+1)=mhat+K*d;% estimator update 


33 

c (:, ;,j + 1) = (I-K*H)*chat;% covariance update 


34 

end 


35 



36 

figure; js=21;plot([0:js-1] ,v(2,1:js));hold;plot([0:js-1],m(2,1 

js),'m'); 

37 

plot([0:js-1],m(2,1;js)+reshape(sqrt(c(2,2,l:js)),l,js),'r—') 


38 

plot([0:js-1],m(2,1;js)-reshape(sqrt(c(2,2,l:js)),l,js),'r—') 


39 

hold;grid;xlabel('iteration, j'); 


40 

title('Kalman Filter, Ex. 1.2'); 


41 



42 

figure;plot([0:J],reshape(c(l,1,:)+c(2,2,:),J+1,1));hold 


43 

plot([0:J],cumsum(reshape(c(l,l,:)+c(2,2,:),J+1,1))./[1:J+1]', 

m' , ... 

44 

'Linewidth',2); grid; hold;xlabel('iteration, j');axis([l 1000 

0 50]) ; 

45 

title('Kalman Filter Covariance, Ex. 1.2'); 


46 



47 

figure;plot([0:J],sum((v-m). ~ 2 ));hold; 


48 

plot ( [ 0 : J] , cumsum (sum ( (v-m) . ''2) ) ./ [1:J+1] ,'m' , 'Linewidth' ,2) ;grid 

49 

hold;xlabel('iteration, j');axis([l 1000 0 50]); 


50 

title('Kalman Filter Error, Ex. 1.2') 
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5.3.2. p9.m 

The program p 9 . m contains an implementation of the 3DVAR method applied to the chaotic 
logistic map of Example 12.41 (15.21) for r = 4. As in the previous section, the parameters and 
initial condition are defined in the setup section, lines 3-16. In particular, notice that the 
truth initial condition v (1) and initial mean m (1) , are now initialized in lines 12-13 with 
a uniform random number using the command rand, so that they are in the interval [ 0 , 1 ] 
where the model is well-defined. Indeed the solution will eventually become unbounded if 
initial conditions are chosen outside this interval. With this in mind, we set the dynamics 
noise sigma = 0 in line 8 , i.e. deterministic dynamics, so that the true dynamics themselves 
do not go unbounded. 

The analysis step of 3DVAR consists of minimizing 

- Hv)\^+ ^\d~Hv-'S{mj))\^. 

In this one-dimensional case we set T = 7 ^, C = and define = 7 ^/ 7 ^. The stabilization 
parameter 7 (eta) from Example 14.121 is set in line 14, representing the ratio in uncertainty 
in the data to that of the model; equivalently it measures trust in the model over the ob¬ 
servations. The choice 7 = 0 means the model is irrelevant in the minimization step (I4.12p 
of 3DVAR , in the observed space -the synchronization filter. Since, in the example, the 
signal space and observation space both have dimension equal to one, the choice rj = 0 simply 
corresponds to using only the data. In contrast the choice ij = 00 ignores the observations 
and uses only the model. 

The 3DVAR set-up gives rise to the constant scalar covariance C and resultant constant 
scalar gain K; this should not be confused with the changing Kj in (15.131) . temporarily defined 
by K in line 31 of p8 .m. The main solution loop follows in lines 20-33. Up to the different 
forward model, lines 21-22, 24, 26, and 27 of this program are identical to lines 24-25, 27, 30, 
and 32 of p8 .m described in section [5.3.11 The only other difference is that the covariance 
updates are not here because of the constant covariance assumption underlying the 3DVAR 
algorithm. 

The 3DVAR filter may in principle generate estimated mean mhat outside [0,1], because 
of the noise in the data. In order to flag potential unbounded trajectories of the filter, which 
in principle could arise because of this, an extra stopping criteria is included in lines 29- 
32. To illustrate this try setting sigma^ 0 in line 8 . Then the signal will eventually become 
unbounded, regardless of how small the noise variance is chosen. In this case the estimate will 
surely blowup while tracking the unbounded signal. Otherwise, if rj is chosen appropriately 
so as to stabilize the filter it is extremely unlikely that the estimate will ever blowup. Finally, 
similarly to p 8 . m, in the last lines of the program we use the outputs of the program in order 
to produce Figure 14.41 namely plotting the mean and the covariance as well as the mean 
square error of the filter as functions of the iteration number j. 
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1 

clear;set(0,'defaultaxesfontsize',20);format long 


2 

%%% p9.m 3DVAR Filter, deterministic logistic map (Ex. 1.4) 


3 

4 

%% setup 


5 

J=le3;% number of steps 


6 

r=4;% dynamics determined by r 


7 

gamma=le-l;% observational noise variance is gamma"2 


8 

sigma=0;% dynamics noise variance is sigma"2 


9 

sd=10;rng(sd);% choose random number seed 


10 



11 

m=zeros(J,1);v=m;y=m;% pre-allocate 


12 

v(l)=rand;% initial truth, in [0,1] 


13 

m(l)=rand;% initial mean/estimate, in [0,1] 


14 

eta=2e-l;% stabilization coefficient 0 < eta << 1 


15 

C=gamma"2/eta;H=1;% covariance and observation operator 


16 

K=(C*H')/(H*C*H'+gamma*2);% Kalman gain 


17 



18 

%% solution % assimilate! 


19 



20 

for j=l:J 


21 

V(j+1)=r*v(j)*(1-v(j)) + sigma*randn;% truth 


22 

y ( j)=H*v(j + 1)+gamma*randn;% observation 


23 



24 

mhat=r*m(j)*(1-m(j));% estimator predict 


25 



26 

d=y(j)-H*mhat;% innovation 


27 

m(j+1)=mhat+K*d;% estimator update 


28 



29 

if norm(mhat)>le5 


30 

disp('blowup!') 


31 

break 


32 

end 


33 

end 


34 

js=21;% plot truth, mean, standard deviation, observations 


35 

figure;plot([0:js-l],v(l:js));hold;plot([0:js-l],m(l:js) ,'m' ) ; 


36 

plot([0:js-l],m(l:js)+sqrt(C), 'r —' );plot([l:js-l] ,y(l:js-l) , 'kx' 

; 

37 

plot([0:js-l],m(l:js)-sqrt(C),'r—');hold;grid;xlabel(' iteration. 

j'); 

38 

title('3DVAR Filter, Ex. 1.4') 


39 



40 

figure;plot([0:J],C*[0:J]."0);hold 


41 

plot([0:J],C*[0:J] .“0,'m','Linewidth' , 2);grid 


42 

hold;xlabel('iteration, j');title('3DVAR Filter Covariance, Ex. 1 

4') ; 

43 



44 

figure;plot([0:J],(v-m)."2);hold; 


45 

plot([0:J],cumsum((v-m)."2)./[l:J+l]','m','Linewidth',2);grid 


46 

hold;xlabel('iteration, j'); 


47 

title('3DVAR Filter Error, Ex. 1.4') 
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5.3.3. plO.m 

A variation of program p9 .m is given by plO .m, where the 3DVAR hlter is implemented for 
Example 12.31 given by (15.1|) . Indeed the remaining programs of this section will all be for the 
same Example 12.31 so this will not be mentioned again. In this case, the initial condition is 
again taken as a draw from the prior N{mQ, Cq) as in p7 .m, and the initial mean estimate is 
again changed to mg ^ N{0,100 /) so that the code may test the ability of the filter to lock 
onto the signal given a poor initial estimate. Furthermore, for this problem there is no need to 
introduce the stopping criteria present in the case of p 9 . m since the underlying deterministic 
dynamics are dissipative. The output of this program is shown in Figure IT751 
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1 

clear;set(0,'defaultaxesfontsize',20);format long 

2 

%%% plO.m 3DVAR Filter, sin map (Ex. 1.3) 


3 

4 

%% setup 


5 

J=le3;% number of steps 


6 

alpha=2.5;% dynamics determined by alpha 


7 

gamma=l;% observational noise variance is gamma 

"2 

8 

sigma=3e-l;% dynamics noise variance is sigma"2 


9 

C0=9e-2;% prior initial condition variance 


10 

m0=0;% prior initial condition mean 


11 

sd=l;rng(sd);% choose random number seed 


12 



13 

m=zeros(J,1);v=m;y=m;% pre-allocate 


14 

V(1)=m0+sqrt(CO)*randn;% initial truth 


15 

m (1)=10*randn;% initial mean/estimate 


16 

eta=2e-l;% stabilization coefficient 0 < eta << 

1 

17 

c=gamma''2/eta; H=1; % covariance and observation 

operator 

18 

K=(c*H')/(H*c*H'+gamma*2);% Kalman gain 


19 



20 

%% solution % assimilate! 


21 



22 

for j=l:J 


23 

V ( j + 1)=alpha*sin(v(j)) + sigma*randn;% truth 

24 

y(j)=H*v(j+1)+gamma*randn;% observation 


25 



26 

mhat=alpha*sin(m(j));% estimator predict 


27 



28 

d=y(j)“H*mhat;% innovation 


29 

m(j+1)=mhat+K*d;% estimator update 


30 



31 

end 


32 



33 

js=21;% plot truth, mean, standard deviation, observations 

34 

figure;plot([0:js-l],v(l:js));hold;plot([0:js-l 

] ,m(l : js),'m' ) ; 

35 

plot([0:js-l],m(l:js)+sqrt(c), 'r —');plot([l:js 

-1],y(l:js-1),'kx'); 

36 

plot([0:js-l],m(l:js)-sqrt(c),'r—');hold;grid; 

xlabel('iteration, j'); 

37 

title('3DVAR Filter, Ex. 1.3') 


38 



39 

figure;plot([0:J],c*[0:J].'0);hold 


40 

plot([0:J],c*[0:J].“0,'m','Linewidth',2);grid 


41 

hold;xlabel('iteration, j'); 


42 

title('3DVAR Filter Covariance, Ex. 1.3'); 


43 



44 

figure;plot([0:J],(v-m)."2);hold; 


45 

plot([0:J],cumsum((v-m)."2)./[l:J+l]','m','Linewidth',2);grid 

46 

hold;xlabel('iteration, j'); 


47 

title('3DVAR Filter Error, Ex. 1.3') 
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5.3.4. pll.m 

The next program is pll.m. This program comprises an implementation of the extended 
Kalman Filter.lt is very similar in structure to p8 .m, except with a different forward model. 
Since the dynamics are scalar, the observation operator is defined by setting H to take value 1 
in line 16. The predicting covariance Cj is not independent of the mean as it is for the linear 
problem p8 .m. Instead, as described in section [4. 2. 21 it is determined via the linearization of 
the forward map around mj, in line 26: 


Cj+i = {a cos{mj)) Cj (acos^mj )). 

As in p8 .m we change the prior to a poor initial estimate of the distribution to study if, 
and how, the filter locks onto a neighbourhood of the true signal, despite poor initialization, 
for large j. This initialization is in lines 15-16, where mo N{0, 100/) and Co ^ IOCq. 
Subsequent filtering programs use an identical initialization, with the same rationale as in 
this case. We will not state this again. The output of this program is shown in Figure IT751 
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1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 

39 

40 

41 

42 

43 

44 

45 

46 

47 

48 

49 


clear;set(0,'defaultaxesfontsize',20);format long 
%%% pll.m Extended Kalman Filter, sin map (Ex. 1.3) 

%% setup 

J=le3;% number of steps 

alpha=2.5;% dynamics determined by alpha 
gamma=l;% observational noise variance is gamma"2 
sigma=3e-l;% dynamics noise variance is sigma"2 
C0=9e-2;% prior initial condition variance 
m0=0;% prior initial condition mean 
sd=l;rng(sd);% choose random number seed 

m=zeros(J,1);v=m;y=m;c=m;% pre-allocate 
V(1)=m0+sqrt(CO)*randn;% initial truth 
m (1)=10*randn;% initial mean/estimate 

c (1)=10*C0;H=1;% initial covariance and observation operator 
%% solution % assimilate! 
for j=l:J 

V(j+1)=alpha*sin(v(j)) + sigma*randn;% truth 

y(j)=H*v(j+1)+gamma*randn;% observation 

mhat=alpha*sin(m(j));% estimator predict 

chat=alpha*cos(m(j))*c(j)*alpha*cos{m(j))+sigma'2;% covariance predict 
d=y(j)“H*mhat;% innovation 

K= (chat*H')/(H*chat*H'+gamma*2);% Kalman gain 
m(j+1)=mhat+K*d;% estimator update 
c(j+1)=(1-K*H)*chat;% covariance update 

end 

js=21;% plot truth, mean, standard deviation, observations 

figure;plot([0:js-l],v(l:js));hold;plot([0:js-l],m(l:js) ,'m' ) ; 

plot([0:js-l],m(l:js)+sqrt(c(l:js)),'r—');plot([l:js-l],y(l:js-l),'kx'); 

plot([0:js-l],m(l:js)-sqrt(c(l:js)), ' r —' );hold; grid; 

xlabel('iteration, j');title('ExKF, Ex. 1.3') 

figure;plot([0:J],c);hold 

plot([0:J],cumsum(c)./[l:J+l]','m','Linewidth',2);grid 
hold;xlabel('iteration, j'); 
title('ExKF Covariance, Ex. 1.3'); 

figure;plot([0:J],(v-m)."2);hold; 

plot([0:J],cumsum((v-m).*2)./[l:J+l]','m','Linewidth',2);grid 
hold;xlabel('iteration, j'); 
title('ExKF Error, Ex. 1.3') 
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5.3.5. pl2.m 

The program pl2 .m contains an implementation of the ensemble Kalman Filter, with per¬ 
turbed observations, as described in section IT.2.3I The structure of this program is again very 
similar to p8 .m and pll .m, except now an ensemble of particles, of size N defined in line 
12, is retained as an approximation of the filtering distribution. The ensemble 
represented by the matrix U is then constructed out of draws from this Gaussian in line 18, 
and the mean ttiq is reset to the ensemble sample mean. 

In line 27 the predicting ensemble {Vj^^}n=i represented by the matrix Uhat is computed 
from a realization of the forward map applied to each ensemble member. This is then used 
to compute the ensemble sample mean fhj (mhat) and covariance Cj (chat). There is now 
an ensemble of ’’innovations” with a new i.i.d. realization ~ N{yj,T) for each ensemble 
member, computed in line 31 (not to be confused with the actual innovation as defined in 
Equation (15.121) 1 

,(n) (n) Tj^{ri) 

dj —Uj - . 

The Kalman gain Kj (k) is computed using (15.1311 . very similarly to p8 .m and pll .m, and 
the ensemble of updates are computed in line 33: 








The output of this program is shown in Figure [TTI Furthermore, long simulations of length 
J = 10® were performed for this and the previous two programs plO .m and pll .m and their 
errors are compared in Figure 14.111 
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1 clear;set(0defaultaxesfontsize',20);format long 

2 %%% pl2.m Ensemble Kalman Filter (PO), sin map (Ex. 1.3) 

3 %% setup 

4 

5 J=le5;% number of steps 

6 alpha=2.5;% dynamics determined by alpha 

7 gamma=l;% observational noise variance is gamma"2 

8 sigma=3e-l;% dynamics noise variance is sigma"2 

9 C0=9e-2;% prior initial condition variance 

10 m0=0;% prior initial condition mean 

11 sd=l;rng(sd);% choose random number seed 

12 N=10;% number of ensemble members 

13 

14 m=zeros(J,1);v=m;y=m;c=m;U=zeros(J,N);% pre-allocate 

15 V(1)=m0+sqrt(CO)*randn;% initial truth 

16 m(1)=10*randn;% initial mean/estimate 

17 c (1)=10*C0;H=1;% initial covariance and observation operator 

18 U(1,:)=m(1)isqrt(c(1))*randn(1, N) ;m(1)=sum(U(1,:)) /N; % initial ensemble 

19 

20 %% solution % assimilate! 

21 

22 for j=l:J 

23 

24 V(j+1)=alpha*sin(v(j)) + sigma*randn;% truth 

25 y(j)=H*v(j+1)+gamma*randn;% observation 

26 

27 Uhat=alpha*sin(U(j,:))+sigma*randn(1,N);% ensemble predict 

28 mhat=sum(Uhat)/N;% estimator predict 

29 chat=(Uhat-mhat)*(Uhat-mhat)'/(N-1);% covariance predict 

30 

31 d=y(j)+gamma*randn(1,N)-H*Uhat;% innovation 

32 K=(chat*H')/(H*chat*H'+gamma"2);% Kalman gain 

33 U(j+1,:)=Uhat+K*d;% ensemble update 

34 m(j+1)=sum(U(j+1,:))/N;% estimator update 

35 c(j+1)=(U(j+1,:)-m(j+1))*(U(j+1,:)-m(j+1))'/(N-1);% covariance update 

36 

37 end 

38 

39 js=21;% plot truth, mean, standard deviation, observations 

40 figure;plot([0:js-l],v(l:js));hold;plot([0:js-l],m(l:js),'m'); 

41 plot([0:js-l],m(l:js)+sqrt(c(l:js)),'r—');plot([1:js-1],y(l:js-l),'kx'); 

42 plot([0:js-l],m(l:js)-sqrt(c(l:js)) , 'r--' );hold; grid; 

43 xlabel('iteration, j');title('EnKF, Ex. 1.3') 

44 

45 figure;plot([0:J],c);hold 

46 plot([0:J],cumsum(c)./[1;J+1]','m','Linewidth',2);grid 

47 hold;xlabel('iteration, j'); 

48 title('EnKF Covariance, Ex. 1.3'); 

49 

50 figure;plot([0:J], (v-m) ."2);hold; 

51 plot([0:J],cumsum((v-m)."2)./[l:J+l]','m','Linewidth',2);grid 

52 hold;xlabel('iteration, j'); 

53 title('EnKF Error, Ex. 1.3') 
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5.3.6. pl3.m 

The program pl3 .m contains a particular square-root filter implementation of the ensemble 
Kalman filter , namely the ETKF filter described in detail in section H.2.41 The program thus 
is very similar to pi2 . m for the EnKF with perturbed observations. In particular, the filtering 
distribution of the state is again approximated by an ensemble of particles. The predicting 
ensemble (Uhat), meanmj(mhat), and covariance Cj (chat) are computed exactly 

as in pl2 .m. However, this time the covariance is kept in factorized form XjXj = Cj in 
lines 29-30, with factors denoted Xhat. The transformation matrix is computed in line 31 

T, = (In + XjH^T-^HX^Y" , 

and Xj = XjTj (x) is computed in line 32, from which the covariance Cj = XjXj is 
reconstructed in line 38. A single innovation dj is computed in line 34 and a single updated 
mean mj is then computed in line 36 using the Kalman gain Kj (15.131) computed in line 
35. This is the same as in the Kalman Filter and extended Kalman hlter (ExKF) of p8 .m 
and pll .m, in contrast to the EnKF with perturbed observations appearing in pi2 .m. The 
ensemble is then updated to U in line 37 using the formula 

= rrij + 

where is the column of Xj. 

Notice that the operator which is factorized and inverted has dimension A, which in this 
case is large in comparison to the state and observation dimensions. This is of course natural 
for computing sample statistics but in the context of the one dimensional examples considered 
here makes pl3 .m run far more slowly than pi2 .m. However in many applications the signal 
state-space dimension is the largest, with the observation dimension coming next, and the 
ensemble size being far smaller than either of these. In this context the ETKF has become a 
very popular method. So its relative inefficiency, compared for example with the perturbed 
observations Kalman filter, should not be given too much weight in the overall evaluation of 
the method. Results illustrating the algorithm are shown in Figure |T8l 
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clear;set(0,'defaultaxesfontsize',20);format long 

%%% pl3.m Ensemble Kalman Filter (ETKF), sin map (Ex. 1.3) 

%% setup 

J=le3;% number of steps 

alpha=2.5;% dynamics determined by alpha 
gamma=l;% observational noise variance is gamma"2 
sigma=3e-l;% dynamics noise variance is sigma"2 
C0=9e-2;% prior initial condition variance 
m0=0;% prior initial condition mean 
sd=l;rng(sd);% choose random number seed 
N=10;% number of ensemble members 

m=zeros(J,1);v=m;y=m;c=m;U=zeros(J, N);% pre-allocate 
V(1)=m0+sqrt(CO)*randn;% initial truth 
m (1)=10*randn;% initial mean/estimate 

c (1)=10*C0;H=1;% initial covariance and observation operator 
U(1,:)=m(1)isqrt(c(1))*randn(1,N);m(1)=sum(U(l,:))/N;% initial ensemble 

%% solution % assimilate! 

for j=l:J 

V(j+1)=alpha*sin(v(j)) + sigma*randn;% truth 

y(j)=H*v(j+1)+gamma*randn;% observation 

Uhat=alpha*sin(U(j,:))+sigma*randn(1,N);% ensemble predict 
mhat=sum(Uhat)/N;% estimator predict 
Xhat=(Uhat-mhat)/sqrt(N-1);% centered ensemble 
chat=Xhat*Xhat';% covariance predict 

T=sqrtm(inv(eye(N)+Xhat'*H'*H*Xhat/gamma“2));% sqrt transform 
X=Xhat*T;% transformed centered ensemble 

d=y(j)-H*mhat;randn(1,N);% innovation 
K=(chat*H')/(H*chat*H'+gamma*2);% Kalman gain 
m(j+1)=mhat+K*d;% estimator update 
U ( j + 1, :)=m(j + 1)+X*sqrt(N-1);% ensemble update 
c(j+1)=X*X';% covariance update 

end 

js=21;% plot truth, mean, standard deviation, observations 

figure;plot([0:js-l],v(l:js));hold;plot([0:js-l],m(l:js) ,'m' ) ; 

plot([0:js-l],m(l:js)+sqrt(c(l:js)),'r—');plot([l:js-l],y(l:js-l),'kx'); 

plot([0:js-l],m(l:js)-sqrt(c(l:js)), 'r—' );hold; grid; 

xlabel('iteration, j');title('EnKF(ETKF), Ex. 1.3'); 

figure; plot ( [ 0 : J] , (v-m) . ''2) ; hold; 

plot([0:J],cumsum((v-m)."2)./[l:J+l]','m','Linewidth' ,2 );grid 
plot([0:J],cumsum(c)./[1:J+1]','r—','Linewidth',2); 
hold;xlabel('iteration, j'); 
title('EnKF(ETKF) Error, Ex. 1.3') 
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5.3.7. pl4.m 


The program pi4 .m is an implementation of the standard SIRS filter from subsection 14.3.21 
The setup section is almost identical to the EnKF methods, because those methods also 
rely on particle approximations of the filtering distribution . However, the particle filters 
consistently estimate quite general distributions, whilst the EnKF is only provably accurate 
for Gaussian distributions. The truth and data generation and ensemble prediction in lines 
24-27 are the same as in pl2 .m and pl3 .m. The way this prediction in line 27 is phrased 
in section 14.3.21 is ^ P(-|u^"^). An ensemble of ’’innovation” terms are again 

required again, but all using the same observation, as computed in line 28. Assuming = 
1/iV, then 


oc oc exp 




where is the innovation of the particle, as given in (14.271) . The vector of un-normalized 

weights (what) are computed in line 29 and normalized to (w) in line 

30. Lines 32-39 implement the resampling step. First, the cumulative distribution function 
of the weights W G [0,1]^ (ws) is computed in line 32. Notice W has the properties that 
Wi = Wn < IFn+i, and Wn = 1- Then N uniform random numbers are 

drawn. For each let n* be such that Wn*-i < < Wn*. This n* (ix) is found in line 

34 using the find function, which can identify the first or last element in an array to exceed 
zero (see help file): ix = find { ws > rand, 1, 'first' ). This corresponds to 
drawing the (n*)‘^ element from the discrete measure defined by The particle 

(U ( j + 1, n) ) is set to be equal to uj" ^ (Uhat (ix) ) in line 37. The sample mean and 
covariance are then computed in lines 41-42. The rest of the program follows the others, 
generating the output displayed in Figure |49l 
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1 

clear;set(0,'defaultaxesfontsize',20);format long 



2 

%%% pl4.m Particle Filter (SIRS), sin map (Ex. 1.3) 



3 

4 

%% setup 



5 

J=le3;% number of steps 



6 

alpha=2.5;% dynamics determined by alpha 



7 

gamma=l;% observational noise variance is gamma"2 



8 

sigma=3e-l;% dynamics noise variance is sigma"2 



9 

C0=9e-2;% prior initial condition variance 



10 

m0=0;% prior initial condition mean 



11 

sd=l;rng(sd);% choose random number seed 



12 

N=100;% number of ensemble members 



13 




14 

m=zeros(J,1);v=m;y=m;c=m;U=zeros(J, N) ; % pre-allocate 



15 

V(1)=m0+sqrt(CO)*randn;% initial truth 



16 

m (1)=10*randn;% initial mean/estimate 



17 

c (1)=10*C0;H=1;% initial covariance and observation operator 


18 

U(1,:)=m(1)isqrt(c(1))*randn(1,N);m(1)=sum(U(l,:))/N;% initial ensemble 

19 




20 

%% solution % Assimilate! 



21 

for j=l:J 



22 

V(j+1)=alpha*sin(v(j)) + sigma*randn;% truth 



23 

y(j)=H*v(j+1)+gamma*randn;% observation 



24 




25 

Uhat=alpha*sin(U(j,:))+sigma*randn(1,N);% ensemble predict 


26 

d=y(j)-H*Uhat;% ensemble innovation 



27 

what=exp(-1/2*(1/gamma"2*d."2));% weight update 



28 

w=what/sum(what);% normalize predict weights 



29 




30 

ws=cumsum(w);% resample: compute cdf of weights 



31 

for n=l:N 



32 

ix=find(ws>rand,1,'first');% resample: draw rand 

\sim U[0,1 

and 

33 

% find the index of the particle corresponding to 

the first 

time 

34 

% the cdf of the weights exceeds rand. 



35 

U ( j + 1,n)=Uhat(ix);% resample: reset the nth particle to the 

one 

36 

% with the given index above 



37 

end 



38 




39 

m(j+1)=sum(U(j+1,:))/N;% estimator update 



40 

c(j+l)=(U(j+l,:)-m(j+l))*(U(j+l,:)-m(j+l))'/N;% covariance update 

41 

end 



42 




43 

js=21;% plot truth, mean, standard deviation, observations 


44 

figure;plot([0:js-1],v(l:js));hold;plot([0:js-l],m(l:js), 

' m' ) ; 


45 

plot([0:js-l],m(l:js)+sqrt(c(l:js)), 'r —');plot([l:js-l], 

y(1:js-1), 

kx' ) ; 

46 

plot([0:js-l],m(l:js)-sqrt(c(l:js)), ' r —' );hold; grid; 



47 

xlabel('iteration, j');title('Particle Filter (Standard), 

Ex. 1.3') 


48 




49 

figure;plot([0:J],(v-m)."2);hold; 



50 

plot([0:J],cumsum((v-m)."2)./[l:J+l]','m','Linewidth',2); 

grid 


51 

hold;xlabel('iteration, j'); 



52 

title('Particle Filter (Standard) Error, Ex. 1.3') 
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5.3.8. pl5.m 

The program pl5 .m is an implementation of the SIRS(OP) algorithm from subsection 14.3.31 
The setup section and truth and observation generation are again the same as in the pre¬ 
vious programs. The difference between this program and pl4 .m arises arises because the 
importance sampling proposal kernel Qj with density f'(vj+i\vj,yj+i) used to propose each 
given each particular in particular Qj depends on the next data point whereas the 
kernel P used in pi4 .m has density P(nj+i|uj) which is independent of Uj+i- 

Observe that if and yj+i are both fixed, then P is the density of the 

Gaussian with mean and covariance E' given by 

^/(n) ^ 5./ H^Y-^yj+i^ , (E')”^ = E"^ -b 


Therefore, E' (Sig) and the ensemble of means In-i (vector em) are computed in lines 

27 and 28 and used to sample Vj^i E') in line 29 for all of (Uhat). 


Now the weights are therefore updated by (14.341) rather than (14.271) . i.e. assuming = 
l/N, then 



This is computed in lines 31-32, using another auxiliary ’’innovation” vector d in line 31. 
Lines 35-45 are again identical to lines 32-42 of program p 14 . m, performing the resampling 
step and computing sample mean and covariance. 

The output of this program was used to produce Figure 14.101 similar to the other filtering 
algorithms . Furthermore, long simulations of length J = 10® were performed for this and the 
previous three programs pl2.m, pl3.m and pl4 .m and their errors are compared in Figure 
14.121 similarly to Figure |4T1] comparing the basic filters plO .m, pll .m, and pi2 .m. 
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1 

clear;set(0,'defaultaxesfontsize',20);format long 



2 

%%% pl5.m Particle Filter (SIRS, OP), sin map (Ex. 1.3) 



3 

4 

%% setup 



5 

J=le3;% number of steps 



6 

alpha=2.5;% dynamics determined by alpha 



7 

gamma=l;% observational noise variance is gamma"2 



8 

sigma=3e-l;% dynamics noise variance is sigma"2 



9 

C0=9e-2;% prior initial condition variance 



10 

m0=0;% prior initial condition mean 



11 

sd=l;rng(sd);% choose random number seed 



12 

N=100;% number of ensemble members 



13 




14 

m=zeros(J,1);v=m;y=m;c=m;U=zeros(J, N) ; % pre-allocate 



15 

V(1)=m0+sqrt(CO)*randn;% initial truth 



16 

m (1)=10*randn;% initial mean/estimate 



17 

c (1)=10*C0;H=1;% initial covariance and observation operator 



18 

U(1,:)=m(1)isqrt(c(1))*randn(1,N);m(1)=sum(U(l,:))/N;% initial 

ensemble 

19 




20 

%% solution % Assimilate! 



21 

for j=l:J 



22 

V(j+1)=alpha*sin(v(j)) + sigma*randn;% truth 



23 

y(j)=H*v(j+1)+gamma*randn;% observation 



24 




25 

Sig=inv{inv(sigma"2)+H'*inv(gamma'2)*H);% optimal proposal 

covariance 

26 

em=Sig*(inv(sigma"2)*alpha*sin(U(j,:))+H'*inv(gamma"2)*y(j) 

); % 

Tiean 

27 

Uhat=em+sqrt(Sig)*randn(1,N);% ensemble optimally importance sampled 

28 




29 

d=y(j)~H*alpha*sin(U(j,:));% ensemble innovation 



30 

what=exp(-1/2/(sigma"2+gamma"2)*d."2);% weight update 



31 

w=what/sum(what);% normalize predict weights 



32 




33 

ws=cumsum(w);% resample: compute cdf of weights 



34 

for n=l:N 



35 

ix=find(ws>rand,1,'first');% resample: draw rand \sim U[0,1 

] and 

36 

% find the index of the particle corresponding to the first 

time 

37 

% the cdf of the weights exceeds rand. 



38 

U(j+1,n)=Uhat(ix);% resample: reset the nth particle to 

the 

one 

39 

% with the given index above 



40 

end 



41 




42 

m(j+1)=sum(U(j+1,:))/N;% estimator update 



43 

c(j+l)=(U(j+l,:)-m(j+l))*(U(j+l,:)-m(j+l))'/N;% covariance 

update 

44 

end 



45 




46 

js=21;%plot truth, mean, standard deviation, observations 



47 

figure; plot([0:js-l] ,v(l:js));hold;plot([0:js-l],m(l:js),'m'); 



48 

plot([0:js-l],m(l:js)+sqrt(c(l:js)),'r--');plot([l:js-l],y(l:js 

-1) , 

'kx'); 

49 

plot([0:js-l],m(l:js)-sqrt(c(l:js)),'r--');hold; grid; 



50 

xlabel('iteration, j');title ('Particle Filter (Optimal), Ex. 1. 

3' ) ; 
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5.4 ODE Programs 


The programs pi 6 . m and p 17 . m are used to simulate and plot the Lorenz ’63 and ’96 models 
from Examples 12.6l and l2.71 respectively. These programs are both MATLAB functions, similar 
to the program p7 . m presented in Section r5.2.5l The reason for using functions and not scripts 
is that the black box MATLAB built-in function ode 4 5 can be used for the time integration 
(see help page for details regarding this function). Therefore, each has an auxiliary function 
defining the right-hand side of the given ODE, which is passed via function handle to ode45. 

5.4.1. pl6.m 

The first of the ODE programs, pi 6 . m, integrates the Lorenz ’63 model[T 6 l The setup section 
of the program, on lines 4-11, defines the parameters of the model and the initial conditions. 
In particular, a random Gaussian initial condition is chosen in line 9, and a small perturbation 
to its first (x) component is introduced in line 10. The trajectories are computed on lines 
13-14 using the built-in function ode45. Notice that the auxiliary function lorenz63, 
defined on line 29, takes as arguments (t,y), prescribed through the definition of the function 
handle @ (t, y), while (a, b, r) are given as fixed parameters (a, b, r), defining the particular 
instance of the function. The argument t is intended for defining non-autonomous ODE, and 
is spurious here as it is an autonomous ODE and therefore t does not appear on the right-hand 
side. It is nonetheless included for completeness, and causes no harm. The Euclidean norm 
of the error is computed in line 16, and the results are plotted similarly to previous programs 
in lines 18-25. This program is used to plot Figs. 12.61 and l2.71 

5.4.2. pl7.m 

The second of the ODE programs, pl7 .m, integrates the J=4 0 dimensional Lorenz ’96 model 
O This program is almost identical to the previous one, where a small perturbation of the 
random Gaussian initial condition defined on line 9 is introduced on lines 10-11. The major 
difference is the function passed to ode4 5 on lines 14-15, which now defines the right-hand 
side of the Lorenz ’96 model given by sub-function lorenz96 on line 30. Again the system is 
autonomous, and the spurious t variable is included for completeness. A few of the 40 degrees 
of freedom are plotted along with the error in lines 19-27. This program is used to plot Figs. 
EH and EH 
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function this=pl6 

clear;set(0,'defaultaxesfontsize',20);format long 
%%% pl6.m Lorenz '63 (Ex. 2.6) 

%% setup 

a=10;b=8/3;r=28;% define parameters 
sd=l;rng(sd);% choose random number seed 

initial=randn(3,1);% choose initial condition 

initiall=initial + [0.0001;0;0];% choose perturbed initial condition 
%% calculate the trajectories with blackbox 

[tl,y]=ode45(@(t,y) lorenz63(t,y,a,b,r), [0 100], initial); 

[t,yl]=ode45(@(t,y) lorenz63(t,y,a,b,r), tl, initiall); 

error=sqrt(sum((y-yl).'2,2));% calculate error 

%% plot results 

figure(1), semilogy(t,error,'k') 
axis ( [0 100 10"-6 10''2] ) 

set (gca, ' YTick', [lO'-O lO'-i 10''-2 lO'O 10*2]) 

figure(2), plot(t,y(:,1),'k') 
axis([0 100 -20 20]) 


%% auxiliary dynamics function definition 
function rhs=lorenz63(t,y,a,b,r) 

rhs(1,1)=a*(y(2)-y (1)); 

rhs(2,1)=-a*y(1)-y(2)-y(1)*y(3); 

rhs(3,l)=y(l)*y(2)-b*y(3)-b* (r+a); 
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function this=pl7 

clear;set(0,'defaultaxesfontsize',20);format long 
%%% pl7.m Lorenz '96 (Ex. 2.7) 

%% setup 

J=40;F=8;% define parameters 

sd=l;rng(sd);% choose random number seed 

initial=randn(J,1);% choose initial condition 
initiall=initial; 

initiall (1)=initial(1)+0.0001;% choose perturbed initial condition 

%% calculate the trajectories with blackbox 

[tl,y]=ode45(@(t,y) lorenz96 (t,y,F), [0 100], initial); 

[t,yl]=ode45(@(t,y) lorenz96(t,y,F), tl, initiall); 

error=sqrt(sum((y-yl) .'2,2));% calculate error 

%% plot results 

figure(l), plot (t,y(:,1),'k') 
figure(2), plot(y(:,1),y(:,J),'k') 
figure (3) , plot(y(: , 1),y(:,J-1),'k') 

figure(4), semilogy(t,error,'k') 
axis ( [0 100 10"-6 10‘2]) 

set (gca, ' YTick', [lO'-e lO'-i 10''-2 lO'O 10*2]) 

%% auxiliary dynamics function definition 
function rhs=lorenz96(t,y,F) 

rhs=[y(end);y(1:end-1)].*([y(2:end);y(1)] - ... 

[y(end-1:end);y(1:end-2)]) - y + F*y."0; 
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